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PREFACE 


The present century has seen extensive developments in methods of 
measuring scientifically the traits and abilities of human beings. 
Three main influences may be di^vered behind this movement. 
First, there is the tremendous interest which people have always 
shown in judging one another’s qualities. Much of their conversation 
consists in summing up their acquaintances. Their plans, either for 
work or play, are very largely guided by their notions regarding the 
persons wi^ whom they are about to deal. It is obvious that such 
notions are frequently biased or inaccurate, and yet each one con- 
tinues to rely, in great affairs or small, chiefly on his own insight into 
the characters of others. However, it is recognized that in certain 
Spheres of human activity, more accurate and impartial assessments 
are essential. The examination system, now firmly established in 
schools, universities, and in many professions, is supposed to meas- 
ure the achievements and capacities of pupils, of students, and of 
candidates for various jobs. Although this, system is certainly an 
advance on the earlier system of personal judgments, yet much 
doubt exists as to its efSciency. A considerable section of this book 
is therefore devoted to a discussion of the defects of examinations, 
and of present-day methods of marking pupils’ or students’ work, 
and to a study of how these may be improved. 

Secondly, mankind is interested, not only in individuals, but also 
in general problems of human nature and human institutions. Thus 
another field where accurate measures of human qualities are needed 
is the scientific investigation of mind, behaviour, and society. Loose 
thin kin g and prejudice are extremely prevalent in matters of politics, 
economics, religion, crime, and the like— as is shown, for example, 
by Thouless in his book Straight and Crooked Thinking (1930). Yet 
the social sciences, such as psychology and sociology, are endeavour- 
ing to study these subjects rationally and impartially, in the same 
manner as the physicist studies the atom, or the physiologist the 
workings of the body. Now the data from which a science is built 
up are very largely numerical and quantitative. The records of a 
chemical or physical experiment- usually consist of measures of 
weights, times, distances, galvanometer or thermometer readings. 
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Measurement is almost equally important for purposes of precise 
description and experiment in the social sciences ; and the deduction 
of sound conclusions from such quantitative data frequently involves 
the application of statistical treatment. In this connection, then, we 
shall consider below the nature of measurement in psychology, and 
the elements of statistical technique, with especial reference to the 
field of education. 

A third source of the mental testing movement lies in the desire 
for human betterment. Medical science has advanced enormously in 
its ability to diagnose and treat bodily ills, but psychiatry and psy- 
chology are only beginning to learn how to handle mental abnor- 
malities. Yet there are already, in addition to the mental hospitals, 
numerous clinics where the diagnosis and treatment of emotionally 
maladjusted patients — adults and children — ^are carried out. Special 
schools,«and tutorial classes in some ordinary schools, play their 
part by providing a type of education sriited to children who are too 
greatly handicapped, mentally or physically, to make progress under 
ordinary school conditions. Industrial schools and the Borstal system 
aim at re-educating delinquents and budding criminals, so as to turn 
them into useful members of society. The staffs of all these institu- 
tions need accurate methods of measuring the educational and 
occupational capacities of the ‘patients.’ Actually many of our most 
valuable mental tests were first constructed by workers in the fields 
of educational, emotional, and vocational guidance. 

This book is thus intended primarily for three classes of students, 
naniely teachers and examiners, psychologists and others who set 
out to investigate human phenomena from the scientific standpoint, 
and doctors and psychologists who wish to make practical use of 
tests in clinics and schools. For reasons of space, no attempt is made 
to describe the varies mental tests in detail, but a classified list of 
those used in this country is included in Chap. IX. Many authors 
have emphasized the value of tests without sufficiently drawing 
attention to their dangers and difiiculties. Thus the main object of 
■this book is to outline the principles of test construction and applica- 
tion or procedure, and the interpretation of the results of tests and 
examinations, ignorance of which on the part of testers has fre- 
quently led to serious errors. Mental measurement undoubtedly needs 
a high degree of skill, which few of the-persons just mentioned have 
either the time or the opportunity to acquire thoroughly, before they 
set.out to test or eximine. 
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There are also available many excellent textbooks on statistical 
methods, perhaps the most useful to testers being those of Garrett 
(1953), Dawson (1933), and Guilford (1936). Yet in tke present 
writer’s experience, all of these are too ^fficult for the majority of 
students whose psychological training only amounts to between 
seventy-five and one hundred and fifty hours. This book aims, 
therefore, to describe the minimum essentials, to show their imme- 
diate application to psychological (and especially to educational) 
problems, and to stress their general significance rather than their 
technical details. 

More advanced students are unlikely to find within much that is 
original. But, so far as the writer knows, no previous account of the 
statistics of marking, nor of the construction of new-type examina- 
tions, has been published by a British author. The analysis of educa- 
tional measurement and the discussion of the technique of examining 
are partly new but owe a great deal to Hamilton’s brilliant book. 
The Art of Interrogation (1929). Possibly also some o^the practical 
hints about tests and testing have not been published before. 

Lack of space has necessitated some important omissions, such as 
the whole field of temperament and character assessment, the testing 
of special (vocational) aptitudes, and of sensory-motor capacities. 
Actually few of the tests falling under these headings are well stan- 
dardized, and fewer still can safely be applied by those who are not 
trained psychologists. 

P. E. V. 

The University 
Glasgow 
July 1939 



PREFACE TO THE SECOND EDITION 

Thb generd prindples ^ mdital testing and elementaiy statistical 
analysis have changed hut little in the sixteen years since this book 
was written. Bnt the ediicationhl context in which' Utey were origin- 
ally set has altered considerably, notably with the passing of the 1944 
Education Act This edition, then, brings the bool; up to date in 
respect of selection for secondary education, thou^ it is still, of 
course, concerned only with the critical study of methods— not with 
the selection problem as a whole. Next, there have been important 
developments in the held of, intelligence theory and intelligence 
testing, gnd it is hoped that the views put forward here represent^ 
more reasonable picture of the relations of innate and acquired 
abilities and pf the findings of factorial aUatysis than has hithertd 
been available to education and psychology students. The compre- 
hensive list of published mental tests has been completely recon- 
structed. Several of the statistical sections have been revised and 
slightly expanded to cover more of the problems which the examiner, 
mental tester, or research worker are Utely to meet, while avoiding 
as far as possible any increase in complexity. Thus in general the 
book covers the same ground as the original edition, at much the 
same length, but in a more up-to-date and more accurate manner. 

. . P. E. V. 

c 

beitruTE OP Bjucaikm 
UravaasnY OP L(»a>oN 
1955 
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CHAPTER I 


INTRODUCTION TO STATISTICAL METHODS 

It is an unfortunate, yet indubitable, fact that the majority of 
students of psychology and education find extreme difficulty in 
grasping and applying the statistical concepts which are an essential 
prerequisite to sound mental measurement They take fright at the 
mere mention of statistics or at the sight of statistical symbols and 
formula;. A small minority, on the other hand, find in the manipu> 
lations of numbers which we call statistics an intense fascination, 
closely akin to the musical composer’s fascination with interwpavings 
of sounds. Probably statistical ability is as specialized and as rare as 
i^ ability in musical composition, and the expert statistician rather 
easily loses touch with the doubts and difficulties of the beginner. 
He fails to realize that the statistical way of looking at educational 
or psychological phenomena, which comes so naturally to him, is 
at first entirely foreign to the average student. Yet the student’s 
’phobia’ is certainly unnecessary, since the elementary statistics 
needed for the proper treatment of marks and test scores includes 
little more than the kind of algebra and arithmetic which he mastered 
by the age of 14. Statistics means nothing more than numerical data, 
such as measurements, examination and test results; and statistical 
methods are merely the ways of dealing with, or organizing, these 
data so as to bring out their full significance. 

The raison cTitre of statistical methods is that human beings are 
apt to differ so widely from one another in any and every measurable 
characteristic, that results or conclusions which apply to one person 
or one group of persons (e.g. one school class), do not necessarily apply 
to another person or group. First, then, we must have efficient tech- 
niques of arranging or tabulating the scores and marks obtained from 
differentpeople,inorder to see whatis the general runoftheir scores, and 
what is the range or extent of the differences. These techniques are 
described in the present chapter. The ’general run’ and the ’range or 
extent* of scores next requiremorepredsedefinitionandmeasurement ; 
that is to say, we must know how to work out averages and indices 
of what is called spread, scatter, dispersion, or variability (Chap. III). 
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Not only do human beings differ from one another, but also each 
individual varies in his capacities and characteristics from time to 
time. The score he achieves to-day may or may not be typical of his 
usual accomplishments. The same holds true of a class of pupils or 
students. The extent of this untrustworthiness or unreliability (as it 
is generally termed) of a single measurement or of a set of measure- 
ments therefore needs to be determined. Statistical treatment will tell 
us whether any alteration or variation in scores represents a real and 
important change — such a change, for example, as we might hope to 
bring about by means of the teaching given to the individual or to 
the class — or whether it is merely a matter of chance fluctuations. 
A further very common and important problem is the reality or 
signiflcance of a difference between two groups of people. For 
instance, girls may appear to be better than boys at reading; but 
this is jiot true of all girls and all boys ; so we must have a method of 
establishing how general and how reliable is the difference (Chap. V). 

Still another type of variation requires special techniques •of 
investigation, namely, the variations in people’s capacities in differ- 
ent spheres. It is obvious that no one is even all-round in his level 
of abilities. Yet we are apt to assume that high achievement in one 
field implies high achievement in other fields. Sometimes also we infer 
the opposite type of relationship— that if a person is strong in some 
trait, then he will be lacking in some other trait. The statistical 
methods of correlation enable us to say how far various traits or 
abilities ’go together’ or are related to one another. They also tell us 
how far our tests.or examinations measure what they are supposed 
to measure, i.e. how valid they are. For example, a Certificate or 
Matriculation examination at the age of 16-18 may, or may not, 
validly predict success in a subsequent university career (Chaps. 
VI-VIII). 

Although statistical methods are, then, of undoubted importance, 
their limitations should also be recognized at this stage in our 
discussion. They are tools for handling numerical data, but they are 
not capable of improving data which may be initially inaccurate or 
biased. When an investigator uses a poor test or examination, or 
when he observes and records people’s behaviour or mental processes 
wrongly, no amount of statistical treatment of his results can correct 
such deficiencies. Further, the fundamentally abstract nature of statis- 
tical methods needs to be stressed. They are adapted primarily for 
mass-investigations of measurements obtained from large numbers 
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of persons, and such investigations are apt to lose touch with the 
concrete individual case. Without them we could not make sound 
generalizations about human beings, but even with them we cannot 
usually state whether a generalization applies to a particular human 
being. The practical handling and understanding of an actual child 
or adult is still an art rather than a science, though it may be greatly 
assisted by scientific mental tests and guided by the results of objec- 
tive psychological experiments and statistical studies. 

TABULATION OF MEASURES 

Measures and Variables . — By a variable we shall mean an ability 
or some other characteristic in respect of which human beings vary 
or differ from one another. Height, age, income, intelligence 
examination success, are all variables in this sense. A measure is the 
numerical expression of a certain standing on, or amount of, a 
variable. Fifty inches of height, five pounds a week income, and a 
nfUrk of 6 out of 10 are examples of measures. 

Most variables are continuous', that is to say, they can vary by 
infinitely small gradations. But we only measure them to the nearest 
convenient unit. For example, heights may be measured to the near- 
est tenth of an inch. But actually SO-1 inches includes all heights from 
50-05 to 50-15; 50-2 inches includes from 50-15 to 50-25, and so on. 
Similarly, a school mark of 6 is usually an approximation represent- 
ing all degrees of quality from 5^ to fij. If, however, half-marks are 
awarded, then 6 represents 5| to 6i, and 6^ represents 6^ to 6J, and 
so on. But some variables are discrete or discontinuous, since they 
only vary by units. For instance, if the size of a class of pupils is 40, 
this does not imply between 39^ and 40^ pupils. The scores for many 
tests, and some school marks, are of this type, for they represent the 
exact numbers of questions or test items correctly answered. Thus a 
pupil would score 6 out of 10 for arithmetic sums even if he had 
nearly completed a seventh sum. (Such measures are referred to in 
the next chapter as count or enumeration scores.) 

This distinction between continuous and discrete variables needs 
to be borne in mind when tabulating measures (cf. p. 5). 

Tabulation . — ^Tabulation means arranging a set of measures in an 
orderly fashion, so as to convey the essential facts about them in 
convenient and relatively brief form. If we take a class of pupils, and 
beside each pupil’s name write his age, or exam, mark, we get a list, 
but it is not a table, since the measures are quite haphazardly 
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arranged. It is very difficult to ^ iiom such a list, if it be a long one, 
what is the general run and range of jneasures, or wheth^ any 
particulaSr measure (e.g. John Smith’s 6 out of 10) is high, low, or 
average. 

Ranking . — One way of tabulating is to rearrange all the measures 
in order of size from highest to lowest; that is, to put them in rank 
order. The desired information can then be grasped inuch more 
readily. But this is a laborious process if the number of pupils is 
large. 

Frequency Distributions . — Under such circumstemces it is better to 
write in a column each possible measure, in order, and beside it the 
frequency, F; that is, the number of persons who obtain each 
measure. The result is called a frequency distribution, since it shows 
how the various measures are allotted or distributed among the 
persons. 

Ex. 1 . — ^The following is a list of the marks (out of 20) which 
were awarded to a class of SO pupils. The table below shows their 
frequency distribution. Marks: 10 , 7, 5, 17, 15, 13, 4, 6, 9 , 15, 16, 
20, 17 , 19 , 14 , 9 , 11 , 17 , 18 , 8, 1 , 9 , 12, 19 , 16, 15, 17, 14, 13, 10, 6, 
20, 15, 18, 17, 13, 11, 4, 16, 13, 8, 14, 18, 15, 12, 17, 16, 15, 17, 2. 


Measures 

F 

20 / / . 

. 2 

19// . 

< . 2 

18 III . 

. 3 

nuuil • 

. 7 

16////. . 

. 4 

15^ / . 

. 6 

14 / / / . 

. 3 

null. . 

. 4 

nil . . 

. 2 

11// . . 

. 2 

10 / / . 

. 2 

9/// . . 

. 3 

8 // . . 

. 2 

7 / . 

. 1 

6 11. . 

. 2 

5 1 . 

. 1 

All . . 

. 2 

3 

. 0 

2 1 . . 

. . 1 

1 / . . 

. 1 

N=50 
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Hints on Drawing Up a Frequency Table . — ^Do not look through 
the origiiial list and pick out all who got the highest measure, e.g. 20, 
then all who got the next highest, 19, and so on. This Would be a 
waste of time and would easily lead to mistakes. Instead go through 
the measures in the list consecutively, and as each one is reached, put 
a tally or check mark opposite the appropriate measure in the 
column. When any one measure recurs for the fifth time, draw a line 
through the first four tallies. Repeat this on reaching the tenth, 
fifteenth, etc., tally. Add up the tallies to obtain the F column. Add 
up the F column itself to make sure that it checks with N, the total 
number of persons. 

Distributions of Classified Measures . — The above procedure is still 
likely to produce an over-lengthy table, especially when N is large. 
It would be absurd, for example, to tabulate the heights of all the 
pupils of a school in this manner. Instead of listing the frequ^cies of 
children 40-0, 40-1, 40-2 . . . inches tall, we would give the totals 
with heights of 40, 41, 42 . . . inches, or the totals with heights of 
40 upwards, 42 upwards, 44 upwards, and so on. Thaf is, we would 
group together sets of contiguous measures under a series of classes 
or categories. 

Considerable care is needed in deciding upon what is, or is not, 
included in a class. When the variable is discrete — for example, test 
scores ranging from 10 to 93 — ^we may call our classes 10-14, 15-19, 
20-24, etc., or else 10 -1-, 15 -|-, 20 -1-, etc. Both of these indicate 
clearly that all scores of 10, 11, 12, 13, and 14 are included in the 
first class, and that the size of each class is five units. The same 
methods may be used for school marks. For instance, I0%-14%, 
15%-19% ... is unambiguous, although, strictly speaking, the 
first class includes all shades of merit from 9^% to 14i%. But with 
some continuous variables such as age or height, difficulties often 
arise because the original measures are themselves only approxima- 
tions which represent classes of measures (cf. p. 3). Thus if height 
has been measured to the nearest tenth of an inch, and is tabulated 
to the nearest inch, we should avoid calling our classes 40, 41, 42 
. . ., etc. For 40 might include 39-95 to 40-95, or 39-45 to 40-45, or 
39-55 to 40-55. If we write 40 -f , 41 -F, etc., the first of these is usually 
intended. With ages, 5 -F, 6 -F ... is quite clear, since the 5 -F 
class includes all children who are at, or who have passed, their fifth 
birthday, but have not yet reached their sixth. 

No set rules can be laid down as to the choice of classes, but the 
M.A. — 1 



6 THE MEASUREMENT OF ABILITIES 

following additional hints will apply to most of the tables which 
teachers or psychologists will need. 

(1) The dumber of different classes is generally somewhere between 
6 and 15. Some tables contain more, few less. When the total number 
of persons, N, is only SO or less, a small number of classes is generally 
sufficient; but if N is nearer 500, a more detailed table with a larger 
number of classes may be desirable. 

(2) Pick out from the original list of measures the highest and the 
lowest measure to be tabulated, and deduce from them the total 
range of the distribution. If the range is from 20 to 1, then 7 classes 
containing 3 different measures each are likely to be appropriate. If 
the range is from 137 to 65, then 15 classes of 5 are a possibility. 
In most instances all the classes should contain the same number of 
different measures, i.e. they should be of the same size.^ 

(3) lys desirable, though not essential, to include an odd number 

of different measures in each class. The reason for this is that in any 
subsequent calculations we are going to regard all the measures 
grouped within any one class as lying at the midpoint of that class. 
If the classes include, say, 3-7, 8-12, 13-17, 18-22, etc., then we 
shall count all the 3’s, 4’s, 5’s, 6’s, and 7’s as 5’s, and all from 8 to 12 
as lO’s. The midpoint of a class is found by taking the average of the 
highest and lowest measures it can contain ; e.g. (3 + 7) 2 = 5. 

The same figure is the midpoint of a class whose true limits are 2^ 
to 7^. But a class such as 39-95 to 40-95 has the rather inconvenient 
midpoint of 40-45. 

(4) With school marks it is advisable to arrange for any round 
numbers, such as the 5’s and lO’s, to fall at the centres of the classes. 
Teachers are very apt to award these round numbers and to neglect 
intermediate ones. Hence, if classes of 10-14, 15-19, 20-24 ... are 
chosen, most of the frequencies may actually lie at 10, 15, 20 . . ., 
and it would be illegitimate to regard them as lying at the midpoints 
of 12, 17, 22 . . . The difficulty will be surmounted if classes of 
8-12, 13-17, 18-22 ... are used instead. 

(5) Having chosen the classes and written them out in a column, 
tabulate the F’s by the ‘tally’ method, as before, and check the 
total, N. 

1 If the distribution is very far from symmetrical, as in the case of incomes, 
school attendances, and the like, a table with unequal-sized classes may better 
portray the state of affairs. But no averaging or other statistical ralculations can 
be carried out with such a table. It may be analysed by the chi-squared technique 
f. . p. 88). 
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Ex. 2. — Below is the distribution of the measures already listed 
in Ex. 1. They have been grouped into classes of 3 marks each. 


Classes F 

\i-20i44+ll .... 7 

\5-n -hhhf ^444 -l-m 1 1 • • 17 

\2-144U4IIH- ... 9 

9-U4mil .... 7 

6-8 44-1-1- 5 

3-5 /// 3 

0-2 1 1 2 


N=50 

GRAPHICAL REPRESENTATION OF DISTRIBUTIONS 

A distribution may often be portrayed more clearly by means of 
a graph, either of the type known as a histogram, or by a frequency 
polygon.^ In both of these the frequencies are plotted on thewertical 

f 



Fio. 1. — Histogram of distribution from Ex. 2. 

^ The ogive type of graph, or cumulative frequency curve, will be omitted 
here. The reader may find accounts of its construction and uses in more advanced 
textbooks, such as Garrett (1933) or Dawson (1933). 
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( Y) axis against the measures, or classes of measures, on the horizon- 
tal (X) axis. The scale of the graph and the location of the origin (the 
point of intersection of the axes) are purely matters of convenience. 
Suppose, for example, that our graph paper measures 6 inches x 9 
inches, each inch being divided into tenths, and that we wish to plot 
the distribution given in Ex. 2. We might use a scale of 3 persons to 
the inch on the Y axis, and one class (3 marks) to the inch on the X 



Fio. 2 , — Incomplete frequency polygon of distribution from Ex. 2. 


axis. This has bech done in Figs. 1-3. Fig. 4 portrays two distribu- 
tions of intelligence test scores which have been grouped into classes 
73-77, 78-82 . . . 133-137. Each inch on the Y axis represents 5 
persons, and each inch on the X axis includes 2 classes (10 points) 
of score. 

A completed graph generally looks best if its largest vertical and 
horizontal dimensions are roughly the same. Its appearance will be 
too peaked if the scale of frequencies is too large, or the scale of 
measures too small, and too flat if the converse holds. 

The origin is often placed at X = 0, but there is no need for this. 
Thus in Figs. 2-3, it is at X = — 2, in order that the Y axis may 
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be drawn at the left-hand edge of the graph paper. And in Fig. 4 it 
is at X = 100, in order that the Y axis may be in the centre of the 
page. 

The Histogram or Column Diagram. — Fig. 1 is a histogram por- 
traying the distribution given in Ex. 2. It consists of a series of pillars 
or columns drawn side by side. Each pillar represents one measure 
or, as in this instance, one class of measures, and its height represents 
the frequency of this measure or class. The marks from 0 to 20 are 
written along the X axis, but note that each mark is represented by a 

F 



Flo. 3. — Completed frequency polygon of distribution from Ex. 2. 


space of i of an inch on this axis, not by one of the inch or tenth- 
inch lines. The first class (including 0, 1, and 2 marks) is, therefore, 
represented by the whole width of the first inch, the second class (3, 
4, and S marks) by the whole width of the second inch, and so on. 

The Frequency Polygon.— AU. the measures included in a class are 
here represented by a point on the graph, whose Y co-ordinate is the 
frequency, and whose X co-ordinate is the midpoint of the dass. 
Thus, in Fig. 2, which represents the same distribution as before, the 
midpoints are written along the X axis, and this time they are located 
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either at an inch or tenth-inch line. The points on the graph are 
connected up by straight lines. Note that it is incorrect to smooth 
the lines ihto a curve, like those in Figs. 10-14. The figure is, as it 
were, suspended in mid-air, hence it is usual to complete it (Fig. 3) 
by connecting the first and last points with additional points whose 
Y co-ordinates are zero. The X co-ordinates of these additional 
points are the midpoints of the classes just below the lowest, and just 
above the highest classes, in the distribution, i.e. X = — 2 and 
X = 22. 

Ex. 3. — For further illustration there follow the frequency distri- 
butions of scores obtained by 136 men and 114 women students on 
the Cattell Intelligence Test, Scale IIIA. The scores are grouped into 
classes of 5. The distributions for the two sexes are plotted as 
frequency polygons on the same graph (Fig. 4), and for the sexes 
combined as a histogram (Fig. 5). Note in the latter that the scale 
along the Y axis has been halved, since combining the sexes roughly 
doubles the frequencies to be plotted. 


« 



Frequencies 


Test Scores 


Men 

Women 

Combined 

133 -1- 


1 

0 

1 

128 -1- 


0 

1 

1 

123 -)- 


5 

1 

6 

118 + 


8 

3 

11 

113 -1- 


13 

9 

22 

108 -1- 


17 

9 

26 

103 4- 


21 

14 

35 

98 -f 


17 

30 

47 

93 -h 


23 

20 

43 

88 + 


14 

13 

27 

83 + 


6 

12 

18 

78 + 


8 

2 

10 

73 -1- 


3 

0 

3 


N = 

136 

114 

250 


Relative Merits of Frequency Polygons and Histograms. — Fre- 
quency polygons possess the advantage over histograms that two or 
more distributions can be superimposed and compared on one graph, 
as in Fig. 4. But they are inferior in that the lines connecting suc- 
cessive points give a somewhat inaccurate notion of the distribution, 
especially when the number of points is small. For instance, in Fig. 4 
the connecting line between the two points X = 110 F = 17 and 
X=115 F=13, suggests that 15 men students obtained inter- 
mediate scores of' 1 12^, which is quite untrue. In this respect the 
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histogram is unexceptionable, since it gives an exact portrayal of the 
tabulated data. The total area enclosed within a histogram corre- 
sponds accurately to the total number of persons, N, The area within 
a polygon corresponds only approximately. Histograms have a further 
advantage, namely, that they can be employed when the variable is 
not a truly quantitative one. For example, examination scripts are 
often graded, not with numerical, but with letter, marks — A, A — , 
B +, etc. The frequency distribution of the numbers of examinees 
who are awarded each letter can be shown by a histogram. 

F 



Fig. 4. — Frequency polygons of distributions from Ex. 3. 
Men. Women. 


CHARACTERISTICS OF FREQUENCY DISTRIBUTIONS 

Smoothness of Distributions , — Here we must distinguish provi- 
sionally between ‘objective’ and ‘subjective’ measures, and discuss 
the difference more thoroughly in the next chapter. Objective 
measures are those which are practically independent of the person 
who makes them, such as heights measured with a ruler, or the 
scores on a standardized educational or intelligence test. Subjective 
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measures, by contrast, depend upon personal evaluations. They 
include, for example, the marks awarded by a teacher to a series of 
English compositions, assessments of pupils’ characters, and the like. 

Now when graphs are drawn of frequency distributions of objec- 
tive measures, certain characteristic features are very commonly 
found. First, the graphs are generally very irregular when the fre- 
quencies are small, but the irregularities tend to become ‘ironed out’ 
as the numbers increase. And when N reaches a value of several 
hundreds, the frequency polygon or histogram approximates to a 
smooth curve. Thus the distribution in Fig. 5 is much more regular 
in its rise and fall than is the distribution in Fig. 6, which represents 
the scores of a group of SO studeuts, selected at random from the 
previous 250. 

Fig. 7, which is a frequency polygon of the heights of 6,194 male 
adults, gives an almost smooth curve. • 



Fio. 7. — Frequency polygon of heights of 6,194 English male adults. 



Fig. 8. — Irregular distribution of test scores grouped in small classes. 



Fio. 9. — More regular distribution of test scores grouped in large classes. 

14 
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Frequencies may be increased not only by increasing the value of 
Ny but also by grouping the measures into larger classes. Figs. 8-9 
show frequency polygons for the same set of 250 measures grouped 
into classes of 3 and 10 respectively. The latter is distinctly smoother 
than the former. 

^^ypes of Distribution Curves , — In actual practice we can never 
obtain measures of a large enough number of persons for the distri- 
bution graphs to become completely smooth curves. Nevertheless, 
it is customary to represent various types of distributions pictorially 
by means of curves of appropriate shapes, to which obtained distri- 
butions approximate more 
or less closely. For example, 
we can say that Fig. 10 
represents roughly the type 
of distribution which might 
be expected for the incomes 
of parents of elementary 
school children. The great- 
est number would prob- 
ably lie between £300 and 
£800 per annum, and the 
frequencies of those at 
higher income levels would 
progressively diminish. This particular type of distribution is called 
a skew curve, because of its asymmetry. It is positively skewed 
towards the upper end, and shows a piling up of cases at the lower 
end. A bunching at the upper end would give a negatively skewed 
curve. 

Fig. 1 1 is a bimodal curve such as might be obtained if we graphed 
the intelligence quotients^ of pupils in two schools, one containing 
very intelligent children of professional class parents, the other con- 
taining inferior children of very poor stock. Most of the I.Q.s fall 
either between 110 and 140 or between 70 and 90, but there are a 
few of even higher or even lower intelligence, and a few (probably 
drawn from both schools) of intermediate or average intelligence, 
with I.Q.s of 90 to 110. 

The two polygons which appear in Fig. 4 are irregular on account 

^ It is assumed that all readers know roughly what is meant by intelligence 
quotients (I.Q.s). A definition may be found on p. 19, and a more precise account 
of their meaning on p. 194. 



Fio. 10. — Skew curve, illustrating approxi- 
mate distribution of annual income 
among parents of elementary school 
children. 




IQ 

Flo. II. — ^Bimodal curve, illustrating in- 
telligence quotients of mixed groups. 
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of the small frequencies 
involved. They approxi- 
mate, however, to a pair of 
curves such as those shown 
in Fig. 12. Note here that the 
more highly peaked curve 
(representing women’s 
scores) does not indicate 
that women obtain, on the 
whole, higher scores than 
men, but that women obtain 
more scores close to the 
average of 100 points, and fewer extreme scores ; in other words, 
that men are more dispersed or spread out in their ability. 

It should be remembered that the areas enclosed beneath distribu- 
tion curves are proportional to the total numbers of cases involved. 

In Fig. 12 the areas aft 
roughly equal, since the total 
numbers of the two sexes are 
about the same. Fig. 13 shows 
the curves to which the dis- 
tributions of scores among 
Honours and among ordinary 
degree students approximate. 
The latter have, on the 
whole, somewhatlower scores 
than, but are four times as 



Flo. 12. — ^Two distributions of intelligence 
test scores, with different dispersions. 


numerous as, the former. It is more usual, when groups of such 
different sizes are being compared, to plot percentage frequencies 

instead of actual frequencies, 
since the curves will then pos- 
sess the same area, and differ- 
ences in the averages and 
ranges or dispersions of meas- 
ures will stand out better. 



Fio. 13. — ^Distributions of intelligence test 
scores of two groups, one much larger 
than the other but of lower average 
intelligence. 


THE NORMAL FREQUENCY 
DISTRIBUTION 

Whenever a large group of 
human beings is taken at ran- 
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dom (i.e. not specially selected like the two schools whose I.Q.s arc 
graphed in Fig. 11), then objective measures of almost any trait or 
ability they possess — physical or mental— will be found to ctonform to 
a certain type of distribution known as the normal curve. T his is often 
referred to as the normal probability curve, the normal curve of 
error, or the Gaussian curve. It is symmetrical, bell- or cocked-hat- 
shaped. It indicates that the majority of persons in the group obtain 
measures round about the average, relatively few obtain extreme 
measures, and that the frequencies above the average are the same 
as the frequencies below it. The four curves in Figs. 12-13 belong to 
this type, and all the graphs from Fig. 4 to Fig. 9 roughly approxi- 
mate to it, since they represent distributions of scores obtained by 
unselected groups of persons. Innumerable instances of distributions 
which tend to this normal shape could be cited : the heights of chil- 
dren in an ordinary school class, the numbers of times they ean tap 
with one finger in a minute, or the numbers of words they can write. 
Variations in individual performances as well as diffeiences between 
individuals show the same trend : if a person tries a hundred times 
to copy a line of a certain length, most of his attempts will be close 
to his own average, and a few will be considerably too long or too 
short. Differences between groups are also similar: if all the ele- 
mentary school classes aged 11 + in a large city are given an objec- 
tive arithmetic test, and the average mark of each is calculated, then 
a graph of all these averages may be expected to show a normal 
distribution. 

Conditions which Upset the Tendency to Normality . — ^There are 
several factors which may disturb this normal tendency. First, the 
total frequencies involved may be so small that the irregularities, 
noted above, may mask it. It is usually apparent, however, even with 
groups of 50 or less (cf. Fig. 6). Secondly, the group of persons may 
be selected in such a way as to distort an otherwise normally distri- 
buted variable. For example, a school class which has been ‘creamed’ 
will be likely to show a negatively skewed distribution of test scores 
or examination marks, since there will be a deficiency of pupils of 
high ability. The police force or the army will not show a normal 
distribution of heights, if no recruits are accepted below a certain 
height. 

Thirdly, there are a few human characteristics which, although 
objectively measurable, are known to be abnormally distributed. 
Accident proneness is one such variable. It is found that a few people 
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are liable to undergo a large 
number of accidents in 
any given period of time, 
whether at home, or in fac- 
tories, or when driving cars. 
But the majority have few 
or no accidents.* Thus the 
distribution takes roughly 
the form shown in Fig. 14. 
Annual income also gives a 
skew distribution, and it is 
likely that other measures 
of ‘productivity’ such as 
literary or scientific output are similar (cf. Burt, 1943). It is hardly 
possible to prove that intelligence quotients conform to the 
normal curve, though this is a convenient assumption that accords 
reasonably v^ith commonsense observation. It does break down 5t 
the bottom end of the scale, since there are more imbeciles and idiots 
of very low I.Q. in the population than would be expected. Some 
investigators have claimed other marked divergences from the nor- 
mal shape, but these might be attributable to inequalities in the units 
of measurement. 

Such irregularities, or defects in our methods of measuring vari- 
ables, constitute a fourth condition, which is of very frequent 
occurrence, as we shall see in the next chapter. 



Fig. 14. — Approximate frequency distribu- 
tion of accident proneness. 


‘ Cf. H. M. Vernon (1936). 



CHAPTER II 


PRINCIPLES OF MENTAL MEASUREMENT AND 

MARKING 

We are all accustomed nowadays to regarding a scholastic capacity 
(e.g. arithmetical ability), like height or weight, as ranging from a 
very small amount in some people to a very large amount in others, 
and to assessing these amounts in numerical units such as percentage 
marks. General intelligence and other aptitudes can similarly be 
graded as quantitative variables, average intelligence being generally 
denoted by an Intelligence Quotient or I.Q. of 100, superjor or 
inferior intelligence by larger or smaller numbers. Yet the conception 
ofiscaling mental qualities in a manner similar to physical ones is of 
fairly recent origin — we owe it largely to the work of* Sir Francis 
Galton — and it is still somewhat limited in application and beset 
with many difficulties. A human ability or trait is far more complex 
than a length, a weight, or a time ; it can never be completely repre- 
sented by a single numeral on a scale. Consider, as an example, 
political opinions — a communist is certainly more socialistically in- 
clined than a liberal, and a fascist is more reactionary than a con- 
servative. Successful tests have, therefore, been devised which scale 
these attitudes as a single variable ranging from extreme left-wing to 
extreme right-wing. But such a scale can hardly do justice to all the 
various shades of political opinion ; it inevitably involves some over- 
simplification and artificiality. Particularly in the field of tempera- 
ment and character does such distortion arise, for here people appear 
to us to vary from one another, not merely in amount, but also in 
kind or quality.^ The same is true, though to a lesser extent, in the 
field of abilities. Thus the precise nature of intelligence is an extremely 
controversial matter, and the common tendency to regard the I.Q. 
as an accurate measure of some definite and stable mental entity is, 
as we shall see below, an egregious misinterpretation. 

No mental trait or ability can be measured directly by laying some 
kind of ruler alongside it. Instead we have to grade it by observing 
its expression in words or actions. Arithmetical ability, for example, 

^ For a fuller discussion of these points, cf. Vernon (1953). 
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is shown by suc^s at a series of sums ; moral character by certain 
modes of behkviour. Indeed, for scientific purposes, we have to regard 
a trait or ability, not as some power or faculty in the mind, but as a 
particular class or category of related performances, e.g. of the 
arithmetical or the moral kind. And special statistical techniques are 
requked for, deciding just what performances should be included 
under ona heading, as representative of one ability or trait (cf. Chap. 
VIII). Sinte we cannot, of course, observe and record all the per- 
formances which go to make up a mental characteristic, we are 
forced to take samples of them. Our procedure, as Binet pointed out, 
is analogous to that of the mining engineer who cannot examine all 
the ground in a given locality, but instead takes borings at various 
points, and from these samples deduces with reasonable accuracy 
what the district as a whole is like. Similarly, the psychologist 
measures intelligence, or the teacher measures achievement in arith- 
metic, on the basis of the child’s success at sample intellectual and 
arithmetical tasks. An arithmetic examination sets, perhaps, six 
questions, the answers to which constitute a sampling of the general 
field of arithmetical knowledge which the pupil is supposed to have 
acquired. 

How, then, is success at such sample performances expressed in 
numerical terms? The various methods of marking or scoring mental 
tests and examinations may be classified as follows : 

A. Enumeration or count scoring. 

B. Ranking. 

C. Qualitative grading and classifying. 

D. Numerical evaluation. 

E. Mixed numerical evaluation and count scoring. 

A. ENUMERATION OR COUNT SCORING 

This type of marking is the only objective type, in the sense that 
each person’s mark is quite independent of the marker’s subjective 
evaluation of his work. When, for example, a series of simple arith- 
metic sums is set, the answers to which are all either right or wrong, 
or when one mark is deducted for each mistake in a dictation test, 
then a pupil’s total mark is a count or enumeration score. Since this 
method is applicable only when no doubt can exist as to the correct- 
ness or incorrectness of the answers wUclwHKs^fflinted up, it is 
largely restricted to^y^ simple kii^ We shall 

f^ater, howevAVA^^I^^ii^il), thj^J^ht^^ntTilN^Qp^dvanced 
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and complex school or university work can be fairly adequately 
measured by sets of brief, objectively marked questions. The great 
majority of standardized educational and general intelligence tests 
follow this method, so as to eliminate the personal element in scoring. 
Steel and Talman (1936) claim that even English compositions can 
be marked by counting all the good and bad points of vocabulary, 
sentence-structure, and sentence-Unking which they contain. The 
scoring is objective also among what are known as performance tests 
of intelligence or of special aptitudes. The child is set some practical 
or manipulative task, and a record is obtained, either of the number 
of pieces successfully put together or of the number of seconds taken 

This type of marking includes two distinct methods of measure- 
ment which may be termed the rate method and the power method. 
A test or examination may contain a number of fairly simple tasks, 
all of approximately uniform level of difficulty. The score is tfaen the 
number of these tasks accomplished within a certain time limit. 
Alternatively, it may contain tasks which are graded in difficulty 
from quite simple to very hard tasks, such that all persons being 
tested can answer some of them, and scarcely any or none can 
manage all of them. Here no time limit is necessary. The score is 
again based on enumerating the tasks successfully completed, since 
those whose ability, knowledge, or ‘power’ is greater are able to 
answer tasks of a higher level of difficulty. These two methods have 
aptly been compared to a hurdle race and a high-jump competition 
respectively. The table of tests given in Chap. IX lists several 
examples of both. Thus, Ballard’s One-minute Reading and Arith- 
metic Tests and Burt’s Reading Fluency Test belong to the former 
type. Burt’s Graded Vocabulary Tests of Reading and Spelling and 
the Terman-Merrill Intelligence Test belong to the latter. Most tests 
and examinations, however, combine the two, i.e. they include easy 
and difficult items, but also impose a time limit, so that the score 
depends on power and rate of working. 

Comparison of Mental Count Scores with Physical Measures , — We 
must ask now whether such count scores can legitimately be regarded 
as measures of mental abilities, analogous, say, to measures of height 
and weight or of physical abilities such as rate of running, power of 
muscular contraction, and the like. Actually the numbers in terms 
of which abilitiesL are expressed show several important differences 
from numbers such is^inches on a ruler or pounds on a weighing 
machine. TJje latter, physical, units are all equal to one another at 
M.A. — 3- 
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any point on the scale. E.g. the difference between 50 and 49 inches 
is precis^ the same as the difference between 40 and 39 inches. 
Moreove!$,%ey always have a zero point : e.g. 0 lb. means no weight 
at all. Mental ability units often lack these features, and this fact 
precludes us from adding, averaging, or otherwise treating them as 
if they were true numbers. 

For example, a one-year-old child might fail to pass any of the 
Terman-Merrill Intelligence Tests, but his zero score shows, not that 
he possesses zero intelligence, but that the tests are an inadequate 
measuring instrument for this purpose. Again, a child with an 
I.Q. of 120 cannot be considered twice as intelligent as one with 
an I.Q. of 60, because we do not know what is the zero point of 
the I.Q. scale. Even in a rate test of, say, reading speed, a child who 
reads 50 simple words in a minute, and so scores twice as much as 
one who reads 25 words in the same time, cannot properly be said 
to be twice as good at reading. 

It is obvious that the units in a rate test, such as words read, can 
never be pr&isely uniform in difficulty. Similarly, the various errors 
in a dictation test may vary considerably in seriousness. It should be 
noted also that two children who obtain the same score on a rate 
test do not necessarily accomplish precisely the same items, since one 
may fail on, or omit, certain items which the other passes ; and they 
may each make the same number of mistakes in the spelling of a 
certain passage, many of which are different. Our tests, then, might 
be compared to a weighing machine with a large number of pound 
weights, some pf which are slightly heavier, some slightly lighter, 
than a pound. When using the machine to weigh children we do not 
always pick out precisely the same set of weights. Now, in spite of 
these defects in the weights, we can claim that two children, both 
found to be 70 lb., are very nearly the same weight, provided that 
the irregularities among the pounds are small. And if four children 
are found to weigh 90, 80, 70, and 60 lb., we can claim that the 
difference between the first two is likely to be nearly equivalent to 
the difference between the last two. Thus this analogy indicates that 
when the number of items in a mental test is large, and irregularities 
are small, the mental units may be regarded as approximately 
equivalent. 

In tests of the pure power, or combined rate and power, types, the 
items are not intended to be equal in difficulty. Instead, each item 
should exceed the previous one in difficulty by approximately the 
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same amouat. For example, each test in the Terman-Meriill Scale 
(from Year VI to Average Adult level) is supposed to represent an 
increment of two months of mental age.' But it is generally admitted 
nowadays that mental age units are not equivalent, and that the 
difference between a child of M.A. 13^ and one of 13 is probably 
much smaller than the difference between children of 6^ and 6. Here, 
then, it is as if we ran out of (nominal) pound weights after a certain 
point, and had to use further weights which progressively increased 
in heaviness. Obviously this leads to grave difficulties in the mental 
age method of scoring (cf. Chap. IV). Several group intelligence 
tests, however, do manage to maintain approximately equal incre- 
ments throughout. 

The units of measurement of many physical skills present similar 
problems. That equal increments of score do not necessarily represent 
equal increments of skill can be seen from the example of golftscores. 
It is decidedly easier for a player who requires 140 strokes for a round 
td improve to 130, than it is for a player who takes 70 to improve 
to 60 (or even — if we could assume golf scores to follow a geomet- 
rical rather than an arithmetical progression — to 65). Here, also, it is 
impossible to establish a zero point of ability, or to calculate ratios 
between abilities. 

Special techniques have been devised by Thorndike, Thurstone, 
and others for the production of measuring scales with equal units, 
and for determining the zero points on such scales. They are too 
complex for description here. But they depend in the main upon the 
principle of normal distribution of objective measures, which was 
enunciated in Chap. I. Well-constructed tests always tend to yield 
a normal frequency distribution; hence we are entitled to regard 
departures from normality as indicative of bad construction or 
bad scoring. For example, a mental test which contains plenty 
of easy and moderately difficult items, but whose later items in- 
crease too rapidly in difficulty to allow sufficient headroom, will 
give a negatively skewed distribution. The analogy with a weighing 
machine would be as follows: suppose that the available weights 
beyond the seventieth beoime progressively bigger than 1 lb., and 
that a large group of children were measured whose true weights 
averaged about 70 lb., and ranged from 50 to 90 lb. The distribution 
would then be distorted from normal to negative skew type, as in 
Fig. 15. Conversely, a test which contains insufficient easy items, so 
' For a definition and discussion of mental age, cf. pp. 69-73. 
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that it only differentiates 
effectively among the better 
pupils, is likely to show a 
positively skewed distribu- 
tion. 

Application to School 
Marks and Examinations . — 
It is difficult to apply this 
principle to school marks, 
since the groups are usually 
so small (less than 50) that 
great irregularities in distri- 
bution are sure to arise. The following table, however, gives the marks 
obtained by two 11+ classes in an ordinary school on an objective 
arithmetic test. Owing to the deficiency in difficult questions, many 
pupils at the top got full marks (80), and the real extent of their 
ability was ^ot tested at all. 

Ex. 4. — An abnormal frequency distribution of arithmetic marks. 


Fig. 15. — Normal distribution (a), altered 
to negative skew curve (6), by increasing 
the size of the higher units. 


Mark 

80 



F 

. 15 

70+ . 



. 6 

60 + . 



. 6 

50+ . 



. 14 

40 + . 



. 17 

30 + . 



. 8 

20 + . 



. 4 

10+ . 



. 5 

0+ . 



. 3 


N - 78 

Further problems occur in the interpretation of count scores. 
Physical measures, such as 60 inches or 15 seconds, always represent 
definite amounts. No doubt exists as to their largeness or smallness, 
because an inch or a second has an absolute, standard size. Laymen 
and teachers often wrongly suppose that the same is true of mental 
measures, such as the mark awarded to a pupil’s work. They talk of 
15/20 or 60% as if these numbers implied the same level of achieve- 
ment whenever, or by whomsoever, they are awarded. Yet it is surely 
obvious that such scores, by themselves, tell us nothing at all about the 
goodness or poorness of achievement. Such statements as: “Johnny 
is doing well at arithmetic. He got 8 sums out of 10 correct” 
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or “His dictation is terrible; he made 15 mistakes” — are quite 
meaningless unless we also know the difficulty of the sums and of the 
passage dictated. Johnny may be the poorest arithmeticfan in his 
class, and yet score 8/10 if sufficiently simple sums are set. Or he 
may be the best speller in his class, and yet make 15 mistakes if the 
passage happens to be taken, say, from a textbook of organic 
chemistry. The speaker is, of course, implying that the tasks are of a 
level of difficulty such that most of the class score less than 8 and 
— 15. In other words, the marks possess only a relative, not an ab- 
solute, significance. Again, when an inspector or head teacher com- 
plains that the average examination mark of a class is ‘too low,’ he is 
presumably comparing the mark and the difficulty of the examina- 
tion with subjective recollections of the achievements of other 
similar classes. Undoubtedly there is a great deal of loose thinking 
about these matters, and much can be learnt from the more sys- 
tematic procedure of the scientific mental tester. 

•The tester realizes that the goodness or poorness of a child’s test 
performance can be interpreted only by comparing it with the per- 
formances of other similar children. Hence he applies his test to 
large numbers of children, similar as to age and other relevant 
circumstances, scores their performances all in the same way, and 
can then state definitely whether a certain performance is superior or 
inferior, and how much belter or poorer it is than the average. This 
procedure is called the establishment of norms or standards for the 
test. A discussion of the main types of test norms is given in Chap. 
IV, where it is also shown how the adoption of analogous methods 
may greatly help in the interpretation of school and examination 
marks. 


13. RANKING 

Ranking means listing a group of pupils or students in order of 
their attainment, 1st, 2nd, etc., down to bottom. This method may 
be applied to any form of scholastic work, including composition, 
handwriting, and the like, which are scarcely amenable to count 
scoring. It is actually the soundest of all forms of marking, in spite of 
its many limitations, because these limitations are so obvious and so 
easily recognized. The first drawback is that the number of persons 
who can be ranked is rather small. Even the arrangement of twenty 
in order involves a good deal of uncertainty, especially near the 
middle of the list. Those at the extremes are usually more readily 
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discriminated from one another. Next, it is clear that these ranks do 
not provide a true numerical scale, since the numbers 1, 2, 3, etc., are 
by no means equally spaced. Usually the differences between the 
attainments of Nos. 1 and 2, 2 and 3, 18 and 19, 19 and 20 (out of 20) 
are greater than the differences between Nos. 9 and 10, 10 and 11 — 
hence the difficulty, just mentioned, in ranking the medium pupils. 

F| 
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Fig. 16. — Rectangular frequency distribution given by a rank order. 

The frequency distribution is rectangular (Fig. 16), and is entirely 
unlike the normal curve. These numbers, therefore, may give quite 
false impressions if they are added, averaged, or otherwise manipu- 
lated. For example, a pupil who is first in one examination and thifd 
in another should be regarded as better than, not as equal to, a 
pupil who is second in both (cf. p. 59).^ 

A third main drawback is that these numbers are always relative 
to the particular set of persons who have been ranked. Thirtieth 
place is not an unequivocal number like 30'’ on a thermometer. For 
30th out of 32 is much poorer than 30th out of 50. Moreover, if any 
one person higher on the list should happen to be absent, the 30th 
rank would change to 29th. It is not even analogous to, say, 30 mis- 
takes in a dictation test, since such a test can be applied to large 
numbers of children, and norms can be collected which will tell us 
how low is the level of ability to which 30 mistakes corresponds. In 
contrast, a rank position can never be so interpreted with reference to 
outside standards^ 

C. QUALITATIVE GRADING AND CLASSIFYING 

Examination scripts, essays, etc., are often merely classified into 
some three to ten grades which are distinguished by letters (A, A — , 
B +, etc.; a, j3, etc.) or by names (Credit, Pass, Fail; 1st class, 2nd 
class, etc.; Excellent, Very Good, Good, etc.). This is a more 

^ The numbers can, however, be converted into a normally distributed set of 
scores (cf. Garrett’s or Guilford’s books). And it is possible, without any con- 
version, to correlate one rank order with another, e.g. to compare a teacher’s 
ranking of school work with the pupils’ results on a group intelligence test 
(pf. Chap. VI). * , 
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primitive type of grading than numerical evaluation (e.g. on a 0 to 10, 
or a percentage, scale), though it is of course often applied to scripts 
which have already been awarded percentage marks. For instance, 
examinees whose mark is 80% or more are informed that they have 
obtained First Class or Credit, and so on. 

Since these grades make no pretence of being numerical measures, 
we need not cavil at the eccentricity of their frequency distributions. 
They are, however, liable to another, still more serious, defect which 
is not found among count scores or ranks, namely that different 
teachers or examiners often employ widely different scales or stan- 
dards of grading. Although they are supposed to represent absolute 
evaluations of achievement, they are actually found by experimental 
investigation to involve such gross discrepancies that neither pupils, 
parents, nor teachers can hope to interpret their significance correctly. 
For example, Hartog and Rhodes (1935) arranged for 48 Schcfbl Cer- 
tificate French scripts to be graded independently by six experienced 
examiners. One of the examiners awarded 46 credits, 2 nasses, and 0 
failures. Another gave 17 credits, 12 passes, and 19 failures. The 
remaining four were intermediate. Similarly, Boyd (1924) had 26 
essays by 11 + children graded by 271 teachers as either Excellent, 
VG +, VG, G +, G, Moderate, or Unsatisfactory. Some of the 
teachers awarded one or other of the two highest grades to as many 
as 15 of the essays, others to none of them. Some also awarded one or 
other of the two lowest grades to 17 of the essays, others to none of 
them. In view of such results, it is clear that there is no agreed 
meaning as to the absolute amount of ability which these gradings 
should signify. A mark of Excellent may represent a high level of 
ability if it is awarded by a teacher who very seldom gives Excellents. 
It may mean little better than average ability if it is awarded by a 
teacher who habitually grades nearly half his papers as Excellent. 
In other words, this type of marking is actually a relative one, like 
ranking. A mark of G f , or p— , or Pass, etc., only achieves any 
meaning if we know what proportions of candidates obtained better 
or poorer marks, in just the same way as a rank position of 30th can 
only be interpreted if we know how many candidates were ranked 
lower than 30th. 

Qualitative grading should then be regarded as a form of ranking 
where, instead of there being as many different ranks as there are pupils 
or examinees, there are only some 3 to 10 ranks, any one of which 
may be awarded to several pupils whose work is of approximately 
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equal merit. Its grave defect is that any one script is not usually 
compared directly with the scripts of a particular group of pupils be- 
fore receiving its grade ; it is merely compared with the marker’s more 
or less vague recollections of the scripts which he has marked in the 
past. The only possible rational meaning for Excellent or First Class 
is that the script which receives this grade seems to him to rank 
higher, or to be relatively better, than almost all the scripts produced 
by similar pupils which he has come across. The other grades are 
similarly referred to rough subjective standards which he has set up 
during his previous marking experience. And since his experience 
naturally differs from all other markers’ experiences, so will his 
standards differ, unless he has taken the trouble to discuss and com- 
pare them with others, and to try to adjust them accordingly. 

Finally, as in the case of rank orders, there is no possibility of 
exact comparison of such grades with external standards or norms, 
and they cannot be added or averaged in any satisfactory way.' 

c 

< D. NUMERICAL EVALUATION 

and 

E. MIXED NUMERICAL EVALUATION AND COUNT SCORING 

Numerical evaluation consists, nominally, in assigning absolute 
numerical grades to pupils’ abilities, school work, or examination 
scripts, either on a 0-10, 0-20, percentage, or some other scale. It 
exhibits all the ambiguities of the qualitative grading just described, 
and possesses several additional defects of its own. In other words, 
although it is the most commonly used, it is quite the worst type of 
marking. Unfortunately, since it involves numbers, it is frequently 
confused with the count scores obtained from objective marking 
(Type A). Teachers assume that the label of, say, 15/20 which they 
award to an English composition is equivalent to a score of 15 cor- 
rect arithmetic sums out of 20. Often the two become inextricably 
mixed. For instance, partial credits involving subjective judgments 
of merit may be awarded to partially right sums. Or an examination 
in history, science, languages, etc., may include, say, five questions 
each of which is marked subjectively out of 20, the marks then being 
added as if they were count scores to yield percentage totals. Or some 
analytic plan of marking may be adopted whereby one mark is 
counted for each correct fact or idea, and so many are added for 
subjective impressions of style or general merit. 

' When the group of pupils is a large one, their grades can be converted into 
rational scores; cf. Dawson (1933), pp. 91-3. 
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Distributions of such marks are sometimes smooth and symmetrical, 
but are often wildly irregular or skewed. In an examination where a 
pass mark of, say, 50% has been fixed, it is common to* find very 
large numbers of 50’s, 5Vs, and 52’s, but no 45’s to 49’s. Some 
teachers show peculiar idiosyncrasies, such as avoiding certain marks 
altogether. For instance, they may award 9, 8, 6, 5, or 3 out of 10 
very often, but scarcely ever give 10, 7, 4, 2, or 1. Two illustrations 
may be given of distributions among students or pupils who also 
took an examination which was objectively scored. 

Ex. 5. — Distributions of (Subjective) Marks for Reading and 
(Objective) Marks for Geography. 


Mark 

Reaoing F 

Geography 

10 

24 

2 

9 

25 

3 

8 

13 

8 

7 

6 

10 

6 

4 

11 

5 

2 

9 

4 

0 

9 

3 

0 

7 

2 

0 

10 

1 

0 

4 

0 

0 

1 


74 

74 


The reading marks in Ex. 5 are fairly typical of the distribution 
adopted by teachers in evaluating this subject, or composition, or 
handwriting. Note that this teacher was not, as she supposed, mark- 
ing out of 10, but out of 6, since she only used 6 different grades. 
The fact that about one-third of the class get 10 implies that the 
work of all these children was equally good, which is certainly not 
true. The geography marks awarded by the same teacher were 
derived from a new-type test paper, and so give a roughly normal 
distribution. It is surely obvious that 5/10 for geography does not 
have the same significance, nor represent the same achievement, as 
5/10 for reading. In the one it means approximately the average for 
the class, in the other it means right at the bottom of the class. But 
the children themselves and their parents are likely to assume that 
5/10 always means the same thing. 

Fig. 17 compares the distributions of percentage marks obtained 
by a large group of students on a new- type examination and on 
an ordinary examination of the essay-type. The former, though 
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F% 



Fig. 17. — Distributions of percentage marks from (a) an objective examination, 
and (d) a subjectively marked examination. 


somewhat irregular, tends towards the normal, symmetrical shape, 
and includes marks ranging from 19% to 86%. The latter is skewed, 
and it ignores the bottom half of ‘he available scale of marks, since 
it ranges only from 50% to 90%. 

Education authorities would surely distrust a school doctor the 
weights on whose weighing machine varied, say, from } lb. to li lb. 
each. Yet the units implied by the distributions which many of their 
teachers employ are probably quite as variable as this, unless some 
scheme of rationalization is apphed (cf. Chap. IV). It is notorious 
also that some faculties in a university use quite different scales of 
marks, or are much more lenient than other faculties. 

It must be obvious that such marks are not true measurements. 
60% does not represent i^o of anything whatsoever. True, it lies a 
certain distance between 0 and 100, but as nobody can define what 
is meant either by zero or perfect ability, that fact does not help 
much. Like the qualitative grades of Type C, they are merely labels 
whose precise size is fixed more or less vaguely by convention. 
Naturally the convention varies in different institutions. 

Numerical grading, like qualitative grading, actually consists of 
ranking the scripts in relation to the marker’s subjective past ex- 
perience. It differ^ n^erely in that number labels are substituted for 
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verbal ones, and that most numerical scales contain a larger number 
of diiferent labels than do qualitative scales. This latter feature leads 
to still another common defect. When an essay or a single*examina- 
tion answer has to be marked on a percentage basis, the implication 
is that the marker can distinguish, and keep in mind, a hundred 
different grades of ability (or 51 different grades if he confines himself 
to a range of 40% to 90%). Actually it has been proved by experi- 
ment that people cannot consistently discriminate more than about 
10 grades, and some would say only 5. 

THE REFORMATION OF MARKING 

The would-be reformer of marks meets with many objections and 
criticisms. Most of them either rest upon ignorance of elementary 
statistical principles, or else ‘boil down’ to the view that the systems 
in vogue at present are firmly established, generally understood by 
all concerned, and work too well and smoothly to be upsetf While 
such a reactionary attitude is entirely unwarranted, yet there are 
aSmittedly many occasions in school marking where the rationality 
of the scales employed matters little, or where an eccentric distribu- 
tion may be an actual advantage. 

Some important examinations do not aim at measuring the 
abilities of all the pupils, but purely at selecting a certain proportion 
of them. For instance, the English secondary school selection exami- 
nation attempts to pick out the best 20% or so of candidates for 
grammar school education. The corresponding Scottish examination 
has roughly the opposite purpose, namely to eliminate the poorest 
pupils who are unfit for advanced secondary school work. Mowat 
(1938) points out that the most appropriate examinations or tests 
in these two instances will be ones which will spread out, or discrimi- 
nate between, candidates whose marks are close to the respective 
borderlines. The former then should yield a positively skewed 
distribution and give plenty of headroom, while the latter should 
yield a negatively skewed distribution and give plenty of footroom. 
The marks of the bottom three-quarters of candidates in the former, 
and of the top three-quarters in the latter, scarcely matter. Hence, 
it might be a waste of time to construct an examination which would 
spread them all out in the proper normal fashion.^ 

^ Nevertheless, attainments tests which cover the whole range are commonly 
employed, partly because the position of the borderline varies widely from one 
area to another, and partly because it is useful to have accurate educational 
quotients to assist in allocating pupils to appropriate streams in any secondary 
school they ultimately attend. 
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Next we can distinguish examinations which are constantly set in 
schools for such pedagogic purposes as ensuring that all, or prac- 
tically all,' the pupils have successfully accomplished a certain stage 
in their work, or that they have learnt, say, some French grammar, 
chemical formulae, or history notes, set to them for preparation. 
Other exercises are needed, notably in mathematical subjects, for 
practice in the operations which the pupils are in process of acquiring. 
In these instances we may expect distributions to show a strong 
negative skew. The tests will naturally be designed so that the major- 
ity can get full or nearly full marks. The weaklings will be somewhat 
more spread out, but even the poorest may be able to attain half 
marks. Doubtless it is because these abnormal distributions have 
become habitual in classwork that most teachers are so blind to the 
defects in their examination marking. Such tests which are used as 
pedagogic aids should be kept quite distinct from tests or examina- 
tions designed for purposes of measuring abilities. When the object 
is to compare different pupils or to find how able a pupil may be ih 
different subjects, then they should certainly be constructed and 
marked according to the principles outlined below. Further, when a 
standardized (published) test of intelligence or attainment is to be 
given to a class or to a group of students, it should always be chosen 
so as to yield a proper frequency distribution (cf. pp. 188-9). 

Rationalization of Marking, — Marking will probably never become 
scientific so long as it is confused with value judgments. Thus the 
first point to realize is that marking is a process of differentiating 
between the various levels of ability existent among the scripts which 
are being marked. Either it is a process of arranging them in rank 
order or of assigning them to one of a limited number of qualitative 
or numerical grades. But it is not a process of deciding whether all, 
or some, of the scixpts are ‘good’ or ‘bad,’ whether the pupils ‘ought 
to have done better’ or ‘have tried hard,’ whether they ‘pass’ or ‘fail.’ 
All such decisions should be postponed until the actual marking is 
completed. 

Often it is desirable that the examination should be of the new- or 
objective-type. The reasons for this advice appear in a later chapter, 
but one of the main ones is that marking is greatly simplified and 
rendered far fairer. In such examinations difficulties arise, not in the 
scoring (which is purely of the count type), but in the setting. They 
should be of such a length, and contain questions of sufficient ease 
and difficulty, to differentiate among both the poorest and the best 
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pupils, and to yield an approximately normal distribution. Ideally 
the average pupil in the class, or the average examinee, should get as 
near as possible to half marks, the poorest pupil little bbtter than 
zero, and the best pupil little less than full marks. For if no examinee 
scores less than, say, 40%, then it is obvious that the easy questions 
at the beginning are a waste of everybody’s time. And if no one scores 
more than, say, 70%, the most difficult questions at the end are also 
wasted. 

Humanitarian Objections . — An immediate, and quite pardonable, 
objection will be raised against this ideal, namely, that if marks range 
from nearly 0% to nearly 100%, the poorer pupils will become 
greatly discouraged when they habitually obtain marks of around 
10% to 20%, and their parents will become bitterly resentful. Some 
teachers are so conscious of the stimulating effects of success and 
the depressing effects of failure, that they deliberately distort»marks, 
giving the duller pupils who have tried hard more than their work 
dfeserves, and the brighter ones who have not done so well as they 
might less than they deserve. It may be that this kindheartedness 
of teachers towards the dullards has been partly responsible for 
forcing up their distributions of marks into the absurd shapes 
typified by Ex. 5 above. Perhaps equally potent has been their fear 
of the criticisms of head teachers and inspectors if they admitted 
their failure to teach the almost ineducable pupils by marking their 
work on its true merits. Worthy as such humanitarian objections 
may be, they do not constitute an excuse for bad test construction or 
bad marking. But the solution is not difficult. We shall see in Chap. 
IV that the actual average, and range, or the scale of marks, em- 
ployed do not matter in the slightest, provided that everyone agrees 
to use the same scale. Either, then, the test may contain sufficient 
(useless) questions which are easy enough to encourage the poor 
pupils ; or the test may be rationally constructed, as indicated above, 
so as to yield a range of, say, 4 to 26 out of 30, or 10 to 70 out of 
80; and when the marking is complete the marks can readily be 
converted into some more acceptable and conventional scale, such 
as 40% to 90%, by the methods described in Chap. IV. A third 
alternative is to withhold all numerical grades from pupils and 
parents, since they are so liable to misinterpretation, and instead, 
once the scientific marking is done and is recorded for administra- 
tive purposes, to reconsider the scripts and attach any verbal or letter 
labels to them that appear suitable for pupil and parental consumption. 
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V 

Grading of Complex Educafidnal Products . — Certain types of work 
such as English composition and handwriting usually have to be 
graded on the basis of total subjective impression. The same is true 
of many examination scripts when the answers consist of a series of 
historical, scientific, or other essays. Here all the answers to any one 
question should invariably be marked before the examiner proceeds 
to another question. Whenever there are about twenty or more such 
essays (or other educational products) to be dealt with, resort should 
be made to the quality or product scale method (cf. p. 154). The 
marker should first skim through a fair number of the scripts until 
he has gained a rough impression of their average level and of their 
extreme range of attainment. No account whatever should be taken 
of his feelings that this level is laudably good or deplorably bad. 
What is needed is the actual average of these particular scripts. 
Next he should look through them more carefully and pick out one 
which typifies the average, set it aside, and call it 5. Next the two 
should be taken which appear to be the best and worst of the bunch ; 
these shoul<f be called either 9 and 1, or 8 and 2. Finally, others 
should be selected which seem to represent levels intermediate 
between 9 (or 8) and 5, and between 5 and 1 (or 2). These should be 
numbered (8), 7, 6, 4, 3, (2). A good deal of revision of first im- 
pressions may be required before a final decision is reached as to 
these nine (or seven) samples. Once they are chosen the marking is 
straightforward. The rest of the scripts are simply compared one by 
one with the samples and each is awarded the mark of the sample 
which it most closely resembles in merit. Should one or two turn up 
which are superior to No. 9 or inferior to No. 1, they may be given 
10 or 0. 

One obvious advantage of this plan is that it enables the marker 
to be self-consistdnt and to maintain the same standards throughout. 
Another is that it does not demand impossibly fine discriminations. 
The number of samples can be reduced to five, or extended to eleven, 
if desired, or some scripts may be given a mark midway between two 
samples, e.g. 6^. But usually seven or nine grades are sufficient. 
Most important, however, is the fact that the final distribution will 
be fairly smooth and symmetrical if the number of scripts is large 
and the samples have been well chosen. Even if their number is small 
there should be a majority of 6’s, S’s, and 4’s, and a minority of 9’s, 
8’s, 2% and I’s; and there should be about as many above as below 
-5. The precise marks given to the samples do not matter. They may. 
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forexample, equally well be called 90%, 85%, 80% . . . 55% 50%, 
so long as the marker resolutely avoids all temptations to withhold 
90% on the irrelevant grounds that the best scripts in the*bunch do 
not seem to him to be ‘up to 90% standard.’ It is, however, probably 
better to do the marking of each question on a 0-10 scale. Such 
marks for several questions in an examination can legitimately be 
added. The final totals can then be converted into any desired con- 
ventional scale, just as can the scores from an objective test. 

Other Types of Marking . — ^It is more difScult to give definite 
recommendations as to the marking of work of mixed type, where 
only part of the merit or demerit resides in points which can be 
counted. In foreign language examinations, history, science, etc., 
both correct or incorrect facts, and general style, or originality and 
the like, may need assessment. Sometimes these quantitative and 
qualitative features can be graded separately and later combined in 
due proportions. It might be better if, as is suggested in Chap. XIV, 
SHch papers were not set; either they should be definitely new-type, 
or definitely of the qualitative essay-type. Nevertheless, the main 
requisites of good marking should always be attainable, namely a 
similar distribution and similar average level of marks in all subjects, 
or in all the questions set in any one subject, and an approximation 
of this distribution to normality. 

Still more difficult are the problems of the marker who has only a 
few scripts with which to deal, e.g. a university external examiner 
who is called in to assess half a dozen degree candidates. Even if the 
papers consisted of perfectly constructed new-type tests, they would 
inevitably yield extremely irregular and variable distributions. The 
examiner is, therefore, forced to employ subjective recollections of 
the merits of similar candidates whom he has marked in the past. 
Nevertheless, he should keep the normal curve in mind, for in the long 
run all the scripts that he meets are likely to be normally distributed. 
He should try, therefore, to build up definite mental specifications 
of his 9, 8, 7 ... 2, 1 levels, and attempt, by thorough discussion 
with the internal examiners, to decide where on this scale the First 
Class, Pass, or other standards fall. 

Some additional points bearing on these plans, and the answers 
to some of the possible criticisms, will be discussed when we have 
studied, in the next chapter, the statistical conceptions of percentiles, 
averages, and standard deviations. 



CHAPTER III 


STATISTICAL METHODS FOR HANDLING 
FREQUENCY DISTRIBUTIONS 

Numerical Description of Distributions . — So far we have dealt with 
various types of distributions of measures almost wholly in verbal 
terms, and have compared one distribution with another mainly ‘by 
eye.’ More exact study of them necessitates quantitative description 
and definition. We need first a numerical expression for the ‘general 
run’ of the measures, and, secondly, some index of the range or 
spread of measures. For instance, if we want to know how intelligent 
Johnny Smith is, and ask a psychologist, he will tell us Johnny’s 
score on an intelligence test and, in order to enable us to interpret 
this score, he will add that the normal score for children of Johnny’s 
age is so and so, and that Johnny’s score falls at such and such a level 
in the distribution. Again, if we ask a teacher what are the heights 
of the children in her class, she might in reply show us a complete 
table of the distribution of heights, or she might tell us with greater 
precision that the average was 50 inches and the range from 42 to 
57 inches. 

Actually the range is not a good index of the way the measures 
are spread out •above and below the average, since it depends on only 
two of the measures, those obtained by the extreme pupils who were 
measured. For example, although this class contains children ranging 
from 42 to 57 inches, all the rest might lie between 46 and 54 inches, 
and we would hardly get a fair picture of the spread if we were only 
told that the total range is 15 inches. A much better index is called 
the Standard Deviation — often abbreviated to S.D. or a, which will 
be described below. 

A distribution of measures can then generally be defined numeri- 
cally by means of its average and standard deviation. But we shall 
describe first an alternative system which possesses certain practical 
advantages. This is based on what are known as the median, quar- 
tiles, and deciles or percentiles. These terms all refer to certain levels 
or fixed points in a distribution. Suppose the measures to be tabu- 
lated or arranged in order, then the median is the middle measure, 
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i.e. the measure half-way up counting from t^e bottom, or half-way 
down counting from the top. The lower and upper quartiles, usually 
referred to as Qi and Q3, are the measures one-quarter and three- 
quarters way up. The decile measures are one-tenth, two-tenths, etc., 
and the percentiles one-hundredth, two-hundredths, etc., of the way 
up. Stated in a different way, the Xth percentile is the measure below 
which X% of all the measures fall. 

Determining the Median. — Ex. 6. — Consider the following distri- 
bution of only five measures. Clearly the median is 15, that is the 
third measure up from the bottom or down from the top. 

19 

17 

15 

13 

11 

N + 1 

From this example it may be seen that the median is the — ^ 

* N 

measure, not the ^th. 

N -h 1 

Ex. 7. — In the following distributions, N = 8, hence — ^ ~ 


9 10 19 

8 9 17 

8 8 14 

7 7 12 

7 6 9 

6 5 5 

6 4 4 

5 3 1 


Thus the median is half-way between the fourth and fifth measures 
from the bottom. In the first distribution the fourth and fifth meas- 
ures are both 7, hence the median is 7. In the second, the fourth and 
fifth measures are 6 and 7, hence we have to take 6^ as the median. 
In the third distribution the data are too sparse for an accurate 
median to be determined. The best we can do is to interpolate and 
call it lOJ. 


Determining the Quartiles and Percentiles . — ^The quartiles are 

N + 1 

similarly obtained by counting the measures which arc — - — and 

- from the bottom. For the distribution given in Ex. 1 (p. 5), 

N = 50, hence the required measures are 12 J and 38^ up the list. 
M.A. — 4 
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The 11th, 12th, and 13th measures are 9, die 37th to 43rd measures 
are all 17, hence = 9, Qj = 17. 

Of the percentiles, the most useful are the 100th or 99th, 90th, 
75th, SOth, 25th, 10th, 1st, or 0th. The 100th and 0th are usually 
the top and bottom measures in the distribution, respectively. The 
75th and 25th are the quartiles, the 50th is the median. The 90th 

9(N + 11 

percentile or 9th decile is the measure which is |q — - from the 

bottom. 

Determining Percentiles from Tabulated Data . — ^When dealing with 
measures which have been grouped into classes, these various levels 
are likely to fall somewhere in the middle of a class. For instance, in 
Ex. 3 (p. 10), the median of 250 measures is the 125^th, which lies 
somewhere within the class 98-102. Its position is found by inter- 
polation, and the following formula may be used. 



P = the percentile. 

X, = the measure falling at this percentile level, which we wish to 
find. 

L = the lower limit of the class, somewhere within which X, lies. 
S = the sum of all the frequencies up to, but not including, this 
class. 

F = the frequency within this class. 

N = the total of all the frequencies. 

C = the size of the class. 

Ex. 8. — ^To find the medijan of the distribution given in Ex. 3, 
substitute in this formula : 

P — 50. L, the lower limit, = 98. S “ 3 10 18 -f- 27 -f- 43 = 

101 . 

F = 47. N = 250. C = 5. 

X, = 98 -F ^ - 101 ) = 100-5 (to the first place of 

decimals). 

By applying the same formula, we find the 99th, 90th, 75th, 25th, 
10th, and 1st percentiles to be 128, 117, 109, 94, 86, and 77 respec- 
tively (omitting decimals). 

Comparison of Median and Average. — ^If the distribution is a 
perfectly smooth and normal one, the median is identical with the 
average. In any fairly regular and symmetrical distribution they will 
be quite close to one another. Thus in Ex. 8, where the median is 
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100-5, the average happens to be 100-68. For rough work the median 
is often used as an approximation to the average, since its computa- 
tion is far easier.' Sometimes, indeed, it gives a better ndtion of the 
‘general run’ of the measures than does the average. For example, 
the average age of a school class may be unduly raised because the 
class includes two or three pupils much older than all the rest. They 
cannot very well be omitted in calculating the average, but if the 
median is found, they no longer exert a disproportionate effect upon 
the result. Similarly, if a free word association test is applied to a 
child at a psychological clinic, most of the child’s responses may be 
given within two to three seconds, but some of them may take much 
longer, and a few words may evoke no response at all. In such a case 
the average cannot be determined, and the median is the most 
reasonable measure of the child’s speed of response. 

Here is another instance where the median may be veiy useful. 
Suppose an investigator wishes to find the approximate average 
*speed with which a group of pupils or students can perform a certain 
task, such as handwriting. There is no need for him to time each 
person separately, nor to count the number of words each one has 
written in a certain time, and then add up and average the results. 
Instead, he may set the whole class writing a standard passage, and 
tell them to hold up their hands directly they finish. Then he may 
count the hands raised, and stop the test as soon as the median 
person raises his hand, noting the time at which this occurs. The 
median speed of writing for a class of, say, 41 persons can be calcu- 
lated from the time taken by the 21st quickest person. 

When a distribution is strongly skewed, the median differs 
appreciably from the average, being larger if the skew is negative, 
smaller if it is positive. In Ex. 1 (p. 14), the median is 14, the average 
12*86. This difference is often used as a measure of skewness (the 
formula can be found in more advanced textbooks). 

Uses of Percentiles . — ^The percentile levels in a distribution clearly 
provide us with as complete a numerical description of that distri- 
bution as we may need. For example, the figures quoted in Ex. 8 for the 
distribution of students’ scores on the Cattell Intelligence Test (the 
99th, 90th, 75th, 50th, 25th, 10th, and 1st percentiles) give a reason- 

' The main disadvantage of medians, and also of percentiles, is that they cannot 
readily be manipulated statistically. For instance, given the medians of distribu- 
tions A and B, we cannot determine from them the median of the combined 
distribution, A -f B ; whereas the averages of A and of B will give us the average 
ofAH-B. 
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ably full answer to the question — how did the students do on the 
test? Knowing these levels, we can begin to interpret the significance 
of any one person’s score, since we can tell how he stands relative to 
this group of students. Thus a score of 92 on the Cattell test would 
not be a very good one for a training college student. It falls at about 
the 20th percentile, which means that 80% of the students did better 
at the test, only 20% did worse. But when we are told that the same 
score falls at the 94th percentile for normal adults (not students), we 
realize that it does represent quite high ability. The applications of 
percentiles to the interpretation of school marks and to test norms 
will be considered later. 

Inter-quartile Range , — ^The range of scores from the 25th to the 
75th percentiles (Qi to Qg) is often called the inter-quartile range. 
It tells us what the middle 50% of the group accomplished. Half of 
this range is often referred to as g, the semi-interquartile range, and 
this is used as a measure of the degree of spread or dispersion of the 
distribution. When the distribution is smooth and normal, Q is 
usually equal to about one-eighth of the total range of measures. 

109 — 94 

In Ex. 3 (p. 10) the total range is 61, and Q ^ ^ 

which is one-eighth of 60. 

CALCULATING THE AVERAGE OR MEAN 

More accurate statistical treatment of distributions demands the 
use of the average and standard deviation in place of the median 
and Q. Everyone knows how to calculate an average or arithmetic 
mean — often referred to a« ‘the mean’ — but most people use an 
unnecessarily complex method, which is very liable to lead to mis- 
takes. We will therefore outline this and the shorter alternative 
methods. 

(1) The ordinary method Consists in adding up all the measures 
and dividing by the number of cases. Expressed as an algebraic 

27X 

formula, M = M = the mean or average, X any one meas- 
ure, and E = the sum of all the ... , hence EX signifies the sum 
of all the separate measures. As a rough general rule, this method 
may be used when N does not amount to more than about 20. With 
larger numbers it becomes very inefficient, unless an adding machine 
is available. 

(2) Often it is quicker to tabulate the measures, to multiply each 
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X by its F, and then to sum. The formula now becomes M = — 

Ex. 9. — The method is here applied to the distribution already 
given in Ex. 1. 


X 

F 

FX 

20 

2 

40 

19 

2 

38 

18 

3 

54 

17 

7 

119 

16 

4 

64 

15 

6 

90 

14 

3 

42 

13 

4 

52 

12 

2 

24 

11 

2 

22 

10 

2 

20 

9 

3 

27 

8 

2 

16 

7 

1 

7 

6 

2 

12 

5 

1 

5 

4 

2 

8 

3 

0 

0 

2 

1 

2 

1 

1 

1 


N = 50 

643 


^>r 643 

^ = 10 = 

12-86 


(3) The process is shortened still further if the measures are 
tabulated in classes. Tf x is the midpoint of each class, then M = 
X{Vx) 

N * 

Ex. 10. — The method is here applied to the distribution given in 
Ex. 2. 


X 

X 

F 

F;r 

18-20 . 

. 19 

7 

133 

15-17 . 

. 16 

17 

272 

12-14 . 

. 13 

9 

117 

9-11 . 

. 10 

7 

70 

6-8 . 

. 7 

5 

35 

3-5 . 

4 

3 

12 

0-2 

1 

2 

N = 50 

2 

641 
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2:(F*) = 641 


M = 


641 

50 


= 12*82 


Note that this value of M, 12*82, is not precisely identical with 
that obtained by the previous method, 12*86, though the discrepancy 
is only 0*3 %. The reason is, of course, that the figures are somewhat 
distorted by being grouped into classes. We are regarding the 20’s 
and the 18’s as 19’s, the 17’s and 15’s as 16’s, and so on. In the long 
run these distortions tend to cancel one another out, and therefore 
have very little effect upon the final result. They only become serious 
if the grouping is too coarse, i.e. if the number of classes in the table 
is too small. 

(4) The amount of computation, when large numbers are involved, 
may be much reduced by applying the device of the arbitrary or 
guessed mean. The following illustration may explain it. Suppose we 
wish to find the average height of a class of children, we could first< 
of all measure 'each child’s height from the ground to the top of his 
head, add up the figures and divide by N. An alternative plan, which 
would do equally well, would be to place each child by a 3-foot table, 
and measure the height from this to the top of his head. We would 
average the measures as before, but would add 36 inches to our 
final result. Again, we might take a stili higher table, say 50 inches, 
and measure from this up to the tops of the heads of the larger 
children, down to the heads of the smaller ones. If we added together 
the former, subtracted the latter, divided by N, and finally added 
50 inches, we should, of course, get exactly the same result as before. 
But the figurte with which we should be dealing would be far smaller 
than in the first instance. We would be averaging, e.g. 6, -t- 1, 
— 2, + 3, — 5, etc., instead of 56, 51, 48, 53, 45, etc. 

The method consists, then, in making a rough guess at the mean, 
listing the amounts by which each measure deviates from (is greater 
or less than) this arbitrary mean, averaging the deviations, and 
finally adding the average to the arbitrary mean. 


M = 


-^(FD) 

“N— + A. 


A is the arbitrary or guessed mean, and D the difference between 
each X and A. 


Ex. 11. — ^The method is here applied to the distnbution given in 
Ex. 9 (p. 41), and it yields precisely the same result for M as was 
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X 




D 

F 

FD 

20 



-t- 

7 

2 

14 

19 



4- 

6 

2 

12 

18 




5 

3 

15 

17 



4- 

4 

7 

28 

16 



4- 

3 

4 

12 

15 



4- 

2 

6 

12 

14 



4- 

1 

3 

3 

13 




0 

4 

4-96 

12 





1 

2 

2 

11 



— 

2 

2 

4 

10 



— 

3 

2 

6 

9 



— 

4 

3 

12 

8 



— 

5 

2 

10 

7 



— 

6 

1 

6 

6 



— 

7 

2 

14 

5 



— 

8 

1 

8 

4 



— 

9 

2 

18 

3 



— 

10 

0 

0 

2 



— 

11 

1 

11 

1 



— 

12 

1 

12 






50 

- 103 


A = 13. M = 12.3 + 13 = _ 014 + 13 = 12-86 


obtained by the ordinary method of averaging. Note that in this ex- 
ample — ' is a minus quantity, and is therefore subtracted from 

27 TFE)^ 

A. If we had chosen A = 12, — would have been a plus 
quantity namely -1- 0-86. 

(5) By far the most efficient method when N is large, say 100 or 
more, is to average, not the actual amounts, but the numbers of 
classes by which the tabulated measures exceed or fall short of A. 
For instance, suppose we have a table of the distribution of ages of 
all the inhabitants of a town, grouped into classes of ten years, with 
midpoints of the classes at S, IS, 25, etc., years. Wishing to calculate 
the average age we take 35 as our arbitrary mean. Obviously there is 
no need to sum the numbers who are 10, 20, 30, . . . years more or less 
than this A. We might just as well sum those who are 1, 2, 3, . . . 
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decides more or less. Then, when we have calculated the average 
deviation. from A in decades^ we must multiply it by 10 to turn it 
back into years, before adding it to, or subtracting it from, A. 

The formula for this method is M = + A, where d is 

the number of the class above or below that class of which A is the 
midpoint, and c is the size of each class. 

Ex. 12. — ^The method is here applied to the distribution given in 
Ex. 3 (p. 10). 


X 

d 

F 

Fd 

133-137 . 

. + 6 

1 

6 

128-132 . 

• + 5 

1 

5 

123-127 . 

. +4 

6 

24 

118-122 . 

. +3 

11 

33 

113-117 . 

• 4" 2 

22 

44 

108-112 . 

• + 1 

26 

26 

103^107 . 

0 

35 

+ 138 

98-102 . 

. - 1 

47 

47 

93-97 . 

. -2 

43 

86 

88-92 . 

. - 3 

27 

81 

83-87 . 

. - 4 

18 

72 

78-82 . 

. - 5 

10 

50 

73-77 . 

. -6 

3 

18 



250 

- 354 


The arbitrary mean is taken as the midpoint of the 103-107 class, 
i.e. as 105. Classes above and below this are numbered consecutively. 
i:(Frf)_ + 138 - 354_ 

N '250 

This is the average number of classes by which M differs from A. 
We must multiply by 5 to get the average deviation in scores, since 
c = 5. 

M = 105 - 5 X 0*864 = 100-68 

Once the student is accustomed to this method he will find it far 
quicker and more accurate than any other. If in this example he had 
used one of the first three methods, he would have been involved in 
a sum totalling about 25,000. Points which require special caution 
are: (i) making sure what measures are included in a class, and what 

y/'P 

is the size of c; (ii) determining A correctly; (iii) mutiplying ^ 
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by c before adding it to ^ ; (iv) making sure of the sign of this 
quantity before adding it to, or subtracting it from, A, 

THE MEAN VARIATION 

The Mean Variation or Average Deviation (M.V. or A.D.) is 
occasionally employed as an index of the extent to which the meas- 
ures are spread out on either side of the mean. As its name indicates, 
it is the average of all the differences between each measure and the 
mean (or sometimes the median), regardless of whether these differ- 
ences are plus or minus. 

Ex. 13. — In this very brief distribution, the mean is 5-6. The D 

li'D 

column lists the deviations. M.V. = = 2-08. If the data were 

N 

in the form of a frequency distribution table, the same method would 
apply, but the formula would be M.V. = 

X D 

9 34 

7 14 

6 04 

4 1-6 

2 3-6 


N = 5 104 

M = 5-6 

Clearly the M.V. will be larger the more widely dispersed or 
spread out the measures. Its value is a little greater than that of Q, 
being generally about one-seventh of the total range of measures. 
(This does not hold in Ex. 13 owing to the shortness and irregularity 
of the distribution.) The M.V. happens to be of very little use for 
purposes of further statistical treatment, hence it is usually passed 
over in favour of the Standard Deviation. 


THE STANDARD DEVIATION 

The S.D. is also a kind of average of all the deviations from the 
mean, but it differs from the M.V. in that the deviations are squared 
before they are summed, and the square root of their average is 
taken. 


S.D. or <7 
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Ex. 14. — ^The working is 
tion as in Ex. 13. 

• X D 

9 3-4 

7 1-4 

6 04 

4 1-6 

2 3-6 


N = 5 
M = 5-6 

As the mean is seldom a whole number, the squaring of deviations 
from it is troublesome. Generally it is simpler to list the deviations 
from some arbitrary mean which is a whole number, and then to 
apply a correction according to the following formula: 

Note that the correction is invariably subtracted, whether its sign 
before squaring is positive or negative. 

Ex. 15. — ^In the same distribution, let A = 6, i.e. the nearest whole 


number to M. 



X 

D 

D* 


9 

3 

9 

D* 30 , _ 

7 

1 

1 

1^ = T=6-0 

6 

4 

0 

2 

0 

4 

(M - A)» = (5-6 - 6)> = 0-16 

■T 

2 

4 

16 

a* = 6 0 - 0-16 = 5-84 




<r = 2-42 



30 


The same formula may* be 

extended to several different types of 


distributions. 


(a) When the distribution is a short one, and all the measures are 
whole numbers, A may be chosen at zero. The D’s will then be the 
original X% so that: 


Ex. 16.— 

X X* 

9 81 

7 49 

6 36 

4 16 

2 4 

186 


_ 

" V N 


M* 


2JX* 186 


= 37-2 M» = 31-36 


N 5 
ff*- 37-2 - 31-36 = 5-84 
ff = 2-42 


here shown for the same short distribu* 


D* 

11- 56 

1- 96 
0-16 

2- 56 

12- 96 


TD* 29-20 


N 


5 

<T = 2-42 


5-84 


29-20 
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This method is normally applied when a calculating machine is 
available, since the machine is not— like the human mind— liable to 
errors when coping with large numbers. 

(b) Often the mean is not known, but is calculated at the same 

time as the S.D. Now M = + A. Hence an alternative version 

of the same formula is : 



(c) When the measures are grouped into classes, the formula 
becomes : 


a 


= C 


J 


N 



s 


Here c is the size of each class, and d the deviation of each measure 
from an arbitrary or guessed mean in terms of number of classes. 


Ex. 17. — ^The formula is here applied to the distribution whose 


mean was calculated in Ex. 12. It was found there that 


N 


Hence the correction | 

(ZiFd)^ 
K N > 

=. 0-746. 


X 

F 

d 

d‘ 

Fd* 

133 + 

1 

+ 6 

36 

36 

128 + 

1 

+ 5 

25 

25 

123 + 

6 

+ 4 

16 

96 

118 + 

11 

+ 3 

9 

99 

113 + 

22 

+ 2 

4 

88 

108 + 

26 

+ 1 

1 

26 

103 + 

35 

0 

0 

0 

98 + 

47 

- 1 

1 

47 

93 + 

43 

-2 

4 

172 

88 + 

27 

- 3 

9 

243 

83 + 

18 

-4 

16 

288 

78 + 

10 

- 5 

25 

250 

73 + 

3 

-6 

36 

108 


250 



1,478 


Z{¥d») _ 

1478 _ 

= 5-912 



ff = 5^5*912 - 0-746 = 11-36 
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One more instance will be given, to illustrate the calculation of the 
mean and S.D. in one operation. 

Ex. 18. — ^The following distribution of school marks is grouped 
into classes of two. Take A at the 11-12 class, i.e. at 11-5. 


X 

F 

d 

Vd 

F4> 

19-20 

2 

+ 4 

8 

32 

17-18 

5 

+ 3 

15 

45 

15-16 

4 

+ 2 

8 

16 

13-14 

4 

+ 1 

4 

4 

11-12 

3 

0 

+ 35 

0 

9-10 

7 

- 1 

7 

7 

7-8 

4 

- 2 

8 

16 

5-6 

4 

- 3 

12 

36 

3-4 

2 

-4 

8 

32 

1-2 

1 

- 5 

5 

25 

N 

= 36 


-40 

213 


2(Fcl)_ 

XT 

-5_ 

0-139. 



This is in terms of classes. In terms of marks, 

^>=- 0-278 

N 

Thus M = 11-5 - 0-278 = 11-22. 

(7 - 2V5-917 - 0-019 = 4-858 

These calculations are quite quickly and easily accomplished with 
a little practice, though some care is needed in checking the arith- 
metic. For example, the whole of the above calculation, using 4- 
figure logarithms and checking by slide-rule, occupied the writer less 
than four minutes. 

The S.D. is always greater than Q or the M.V. This is to be expec- 
ted, since any big deviations from the mean receive much greater 
weight in the S.D. than in the M.V. by being squared before they are 
averaged. It generally amounts to one-fifth or one-sixth of the total 
range, or sometimes more if the distribution is irregular, as in Ex. 18. 
This rule enables us to apply a rough check to our result. If in Ex. 18 
we had obtained cr = 2 or or = 10 , we should have known that there 
had been some mistake in the computation. 
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The Significance of the S.D. — It was shown earlier in this chapter 
that any distribution of scores could be fairly completely defined or 
numerically described provided we knew some of the main percentile 
levels. Now if a distribution approximates to the normal shape (as it 
should do when obtained from a well-constructed test or examination), 
it can be more simply specified merely by finding the mean score and 
the standard deviation. The reason for this is that the normal curve 
is a regular shape which, like the straight line or the parabola, has 
its own algebraic equation. For instance : y = kx gives a straight line, 
running through the origin, whose slope depends on the size of k. A 
normal curve’s equation is more complicated, but actually involves 
nothing more than x, y, o and certain constants; 


— -r« 
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X refers to any given score expressed as a deviation from the mean, and 
y is the frequency of that score, expressed as a proportion of the total 
number of cases. 

This means that, if we know the S.D. of a distribution, we can 
deduce y for every value of x, that is the frequency of every possible 
measure. For instance, although we cannot measure the intelligence 
quotients of the whole population of Great Britain, yet, if we may 
assume that the I.Q. is a variable which is normally distributed, with 
a mean of 100 and a S.D. of 15, we can state at once just what pro- 
portions of the population have I.Q.s of 100, 101, . . . etc., or what 
proportion lies below 67 1.Q., or between 120 and 130 1.Q., and so on. 
Furthermore, there are available in most statistical textbooks tables 

X 

which will tell us the value of >'corresponding to each value of regard- 

CT 

less of what the variable may be (I.Q., height, handwriting speed,etc.). 
These are known as ‘probability integral’ tables. The graph given in 
Fig. 18 is based on such tables, but is probably easier to use, and is 

X 

accurate to two or three significant places. Against - is plotted, not y 

(7 

— the actual proportion of measures which deviate from the mean 
by this amount — but the proportion which deviate by this amount or 
more. Consider for example I.Q. 115. Here xis+ 15, and if the S.D. 

is 15, then - = 1. That is, 115 represents a deviation of 1 x a above 

G 

X 

the mean. Now the lowest curve in Fig. 18 tells us that when - — 1, 

a 
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y == 0-16. In other words, about 0*16 or 16% of the population 
would be expected to have LQ.s of 1 IS upwards. Similarly about 
16% hdve I.Q.S of 85 and below, since 85 represents a deviation of 
— 1 X O’ from the mean. If we enquire how many I.Q.s there are of 1 30 

X 

upwards, - = +2, and the second curve gives the answer 0-023, or 

about 2-3%. Note that the same curve%can be used for making 
deductions about any other normally distributed variable, provided 
allowance is made for the size of the S.D. The upper curves are for 

X 

dealing more accurately with larger values of - and smaller values of 

O’ 

y. Each curve is,,as it were, a 10 times magnification of a section of 
the curve below. 

Ex. 19. — In a certain area, with a school population of 25,000, an 
intelligence test with a S.D. of 16^ was applied. How many children 
may be expected to have I.Q.s of 120 to 130? And what will be the 
highest I.Q. among the least intelligent 1,000 of these children?* 
Intelligence is something which varies continuously ; that is to say, 
every possible value of I.Q. in between 119 and 120 can occur, 
although we usually calculate it only to the nearest whole number. 
Strictly speaking, then, we want to know how many children have 
LQ.s between 119-50 and 130-50, and to exclude those whose I.Q.s 
are 119-49 or less (who would count as 119’s), also those whose 
I.Q.s are 130-51 or more (who would count as 13rs). Now 119-5 
and 130-5 differ from the mean by 19-5 and 30-5, that is by l-18a and 
l-85<r, taking a as 16^. Reading off from Fig. 18, the corresponding 
proportions are 0-119 and 0-0322. Hence the proportion of the 
population within the given limits is 0-119 — 0-0322 = 0-0868 = 
8-68%. 8-68% of 25,000 is 2,170, the answer to our first question. 
In the aecond question, 1,000 out of 25,000 is a proportion of 

0- 040. From Fig. 18, 0-04 corresponds to an x value of l-75a. 

1- 75 X 16i = 28-875. The lowest thousand therefore have I.Q.s of 
100 — 28-875, or less. Hence the answer, to the nearest whole 
number of I.Q., is 71. 

Ex. 20. — Does the distribution of intelligence test scores given in 
Ex. 17 (p. 47) conform to a normal distribution? In other words, are 
the obtained frequencies similar to the frequencies which would be 
expected for a truly normal distribution with the same mean (100-68) 
and the same S.D. (11-36)? 

These test scores, unlike I.Q.s, are discrete. Nobody could score 
lO?-}, however carefully his score was computed. If he nearly, but 
not quite, completed 108 items, his score would still be 107. Thus 
in orwr to find what proportion may be expected (according to the 
normal curve) to fall within the 103-107 class, we must determine 
the proportion with scoref of 103 or more and subtract from it the 
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proportion with scores of 108 or more, not the proportions lying 
between 102-5 and 107-5. The calculations for all classes are given 
in the following table. For example, 103 is a deviation of + 2-32 
from the mean, or + 0-20 ff ; 108 deviates -f 7-32 or + 0-65 <r from 
the mean. From Fig. 18 we find the proportions lying at or beyond 
these two limits to be 0-421 and 0-258 respectively. Hence the pro- 
portion to be expected within the 103-107 class is 0-421 — 0-258 = 
0-163. N is 250, and 0-163 x 250 = 41. The penultimate column 
lists the expected frequencies (omitting decimals), and the last 
column gives the frequencies actually obtained. The two columns 
agree fairly closely, but there are some discrepancies. At present we 
lack a method for telling whether these discrepancies should be 
regarded as appreciable or important, but such a method will be 
described later (pp. 90-1). The full answer to this question will 
therefore be postponed. 


X 

X 

X 

a 

Proportions 
lying at or 
beyond these 
values of xja 

Proportions 
expected 
in each 
class 

F ex- 
pected 
in each 
class • 

Fact- 

ually 

ob- 

tained 

133 + 

+ 32-32 + 

+ 2-85 

0-00219 

0-00219 

1 

1 

128 -f 

+ 27-32 + 

+ 2-41 

0-0078 

0-00561 

1 

1 

123 -f 

+ 22-32 + 

+ 1-97 

0-025 

0-0172 

4 

6 

118 -1- 

-h 17-32 + 

+ 1-53 

0-062 

0-037 

9 

11 

113 + 

+ 12-32 + 

+ 1-09 

0-138 

0-076 

19 

22 

108 + 

+ 7-32 + 

+ 0-65 

0-258 

0-120 

30 

26 

103 + 

+ 2-32 + 

+ 0-20 

0-421 

0-163 

41 

35 

98 + 

- 2-68 + 

- 0-24 

0-404 

0-175 

44 

47 

93 4- 

- 7-68 + 

-0-68 

0-246 

0-158 

40 

43 

88 + 

- 12-68 + 

- 1-12 

0-133 

0-113 

28 

27 

83 -1- 

- 17-68 + 

- 1-56 

0-059 

0-074 

18 

18 

78 + 

- 22-68 + 

- 2-00 

0-023 

0-036 

9 

10 

73 + 

- 27-68 + 

- 2-44 

0-0073 

0-0157 

4 

3 

68 + 

- 32-68 + 

- 2-88 

0-00195 

0-0073 

2 

0 





1-00000 

250 

250 


On examining Fig. 18, we see that the proportions whose scores 
deviate by 2-5^ or more are very small, less than 1 in 100, and that 
those who deviate by 3? or more only number about 1 in 1,000. 
Here, then, we have the reason for a statement made earlier, namely 
that the S.D. usually amounts to one-fifth or one-sixth of the total 
range of a distribution. For when iVis about 100, and the distribution 
is a regular one, we shall ordinarily find the measures ranging from 
+ 2-5 <t to — 2-5 a , i.e. a total of 5< t , And when N approaches 1,000 
we may find scores running from -f- 3or to — 3£r, a total of 6tr. 

There is yet another way of interpreting normal distributions, 
which mil useful to us later, that is in terms of probabilities. 
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Since 0'159 of persons may be expected to obtain measures la or 
more above the mean, the probability that any one person will 
measure this much or more is 01 59. The probability that he will 
measure less is 0-841. In other words, the odds are 841 to 159, or 
about 5^ to 1 against his measure being as high as or higher than 
this. Similarly the probability of a measure as high as + 3cr or more 
is 0-00135, and the chances against it are 740 to 1. This means, for 
example, that we should probably find only one child in an ordinary 
elementary school of about 740 pupils with an I.Q. of 145 or more, 
if we assume the I.Q.s of school children to be normally distributed 
with a O' of 15, since 145 = 100 + 3 x 15. But as many tests have 
considerably larger S.D.s, they will yield a greater number of high 
I.Q.s, and the probability of coming across such children will be in- 
creased. On the other hand the range of intelligence in any one 
primary school is often limited, and this would make the odds 
against high I.Q.s much stronger. Thus if a is 12^, then 150 is 4a 
above the mean, and we can deduce from Fig. 18 that only about 1 
in 31,000 (in schools with an average I.Q. 100) would reach this level. 
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INTERPRETATION OF SCHOOL MARKS AND 
TEST SCORES 

Combining or Comparing Distributions of Marks , — In Chap. II we 
found that the marks awarded to school work or examinations are 
often so erratically distributed that it is impossible to interpret them, 
as they stand, correctly. The first stages in the interpretation of 
marks involve combining and comparing them. For instance, the 
marks for different questions in one examination, or different exami- 
nations in one subject, need to be combined to give a totaP score. 
Often the marks on several subjects are combined to yield a general 
view of the pupil’s or student’s all-round ability. Comparison enters 
when we wish to know whether a pupil or student is better on one 
subject than on some other subject (subjects which may have been 
marked cither by the same, or by different, examiners or teachers). 
In large-scale examinations where several examiners each deal with a 
proportion of the scripts it is, of course, essential that their markings 
should be comparable. Sometimes, also, age allowances must be 
made; the marks of younger and older pupils need to be rendered 
comparable by the elimination of differences due to age. 

Now, none of these processes can be carried out directly unless 
the marks to be combined or compared occur in closely similar 
frequency distributions. This is a fundamental statistical principle, 
very elementary in nature, yet very widely neglected. But with the 
aid of the statistical tools described in Chap. Ill, we are now in a 
position to determine whether such distributions are suilicienlly 
similar, and if they are not, to adjust them in such a way as to make 
them so. These tools will further assist in interpretation, since they 
provide a scientific definition of those exceedingly vague conceptions 
— the goodness or badness of a mark or test score, and the ease or 
difficulty of an examination question or test item. 

Distributions may be dissimilar from one another in three respects : 
(1) differences in their averages arise, for example, when one exam- 
iner is far more lenient than another ; (2) differences in their disper- 
sions may arise when one examiner spreads out his marks far more 

M.A.— 5 
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widely above and below the average than another; (3) differences 
occur in their shapes, e.g. normal, skew, rectangular, etc. For the 
time being we will consider only the most important shape," the 
approximately normal. We may then state that marks can be directly 
combined or compared only if drawn from distributions with the 
same mean and the same S.D. 

The Importance of Equivalent Standard Deviations , — Identity of 
S.D.s is the more important of these two requirements, though it is 
less often recognized. In many instances of combination of marks, 
the averages do not matter in the slightest. 

Ex. 21. — Suppose that the final marks of a group of pupils or 
students depend on their results in three subjects, A, B, and C, and 
that the distributions are as follows : 

Subject Mean Range S.D. 

A . .60 40-80 7 

B . .50 30-70 7 

C . .80 60-100 7 

Now, a pupil who is average on all three subjects obtains a total 
of 190/300, and another who is average on two subjects but top on 
the third scores 210/300 whatever the subject. 

Ex. 22. — In contrast to Ex. 21, let us now combine the following 
sets of marks, whose S.D.s differ. 


Subject 

Mean 

Range 

S.D. 

A . 

. 60 

40-80 

7 

B . 

. 50 

10-90 

14 

C . 

. 80 

70-90 

3i 


A pupil who is average all round again scores 190. But now a 
pupil who is average in two subjects and top in a third obtains 210, 
230, and 200 according as the third subject is A, B, or C respectively. 
The fact of doing well in subject B (which has the lowest average) 
actually raises the score tjie most because B has the biggest S.D. 
And doing well in C has the least effect in raising the final total 
(although it has the highest average), because it has the smallest S.D. 
Similarly if we compare the performances of three pupils who are 
at the lower quartile on one subject but average on the other two, 
we find their respective marks to be 185, 181, and 188. Doing badly 
on B lias a big effect, but doing badly on C has little effect. 

The general principle, illustrated by these examples, is that when 
several sets of marks are combined, their relative ‘weights,’ i.e. their 
respective degrees of influence upon the final total, are proportional 
not to their averages but to their S.D.s. If in a final mark we wish to 
give more weight to one subject than to another, e.g. twice as much 
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weight to written as to mental arithmetic, we shall accomplish it, 
not by making the average mark on the former twice the average on 
the latter, but by making the range and S.D. of the former twice as 
large. The usual procedure in such weighting is to multiply the for- 
mer set of marks by two. This may produce the desired effect, not 
because it doubles the average, but because it doubles the S.D. The 
same effect would follow if the average remained unchanged while 
the deviation of each pupil’s mark from this average was doubled. 
When the marks derived from several questions in a single examina- 
tion are to be combined in certain proportions, this same principle of 
adjusting the S.D.s for the questions must be borne in mind. 

In order to combine or compare two sets of marks whose distribu- 
tions are dissimilar, we should then express each set in the form of 
deviations from their own means, and multiply one set by a figure 
which will raise or lower its S.D. to the same level as that of the other 
set. For instance, in the two sets B and C in Ex. 21, the ratio of S.D.s 
i§ 14 to 3i, or 4 to 1. The deviations in C will therefore become 
equivalent to those in B if multiplied by 4. Alternatively, each mark 

X 

may be expressed in - form. The top marks on A, B, and C thus be- 


80 - 60 90 -50 ,90 - 80 ... . . i , ^ 

come ^ , — 14 ^ ihesQ equal + 2-86. 

When so converted, they are all equivalent. 

Scores which are expressed in this form are referred to as standard 
scores, or a scores, or sometimes z-scores. As we shall see below, 
they are frequently adopted in scoring mental tests. For example, the 
vocational tester may wish to know whether a man’s performance 
on a mechanical test is relatively higher or lower than on a clerical or 
some other test. The tests are likely to have different means and S.D.s, 
but the man’s standard scores can legitimately be compared. 

The Importance of Equivalent Means , — Naturally the averages of 
distributions are important under some circumstances. If marks on 
several questions or several subjects are to be combined, and alterna- 
tives are allowed, so that each pupil’s final mark is derived from a 
selected number of questions or subjects, then it is quite essential to 
ensure equivalent means. Thus in Exs. 21 and 22, if pupils took either 
A and B, or A and C, those who were at the average level would total 
1 10 on the former combination, 140 on the latter. Again, when scores 
are to be, not combined, but compared, it is obvious that differences 
in means will entirely upset the comparisons. The adjustment of 
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discrepancies in means is easy — merely add or subtract so much from 
all the marks in one subject or question until its average is the same 
as that for the other subject or question. More often than not, how- 
ever, discrepancies in means are accompanied by discrepancies in 
S.D.s. In that case resort must be made to the technique already de- 

scribed, according to which each mark is expressed as or to some 

cr 

modification of this technique (cf. pp. 59-62). 

. EQUATING OF MARKS BY THE PERCENTILE 
TECHNIQUE 

Although the statistical treatment of distributions in terms of 
means and S.D.s is far from difficult, it may admittedly lie outside 
the sccpe of the ordinary teacher or examiner. Further, it cannot be 
used when the distributions are clearly non-normal. For instance, if a 
headmaster, to whom were submitted the marks listed in Ex. 5 (p. 29) 
wished to compare the pupils’ standings on reading and geography, 
he could not use this technique. The alternative, percentile, system 
would, however, be quite suitable for making such a comparison, 
and it is probably simpler for the uninitiated to comprehend. (It does 
not even involve looking up squares and square roots, as does a 
technique based on the S.D.) 

Percentiles will provide a uniform scale in terms of which compari- 
sons may be made between any sets of marks. Whatever the test or 
examination which a group of pupils or students takes, a certain 
percentile level always means the same level of achievement (relative 
to that group of people). For instance, on looking at Ex. 5 we might 
be inclined to suppose that 8 marks for reading represents a better 
achievement than 3 marks for geography ; certainly the pupils them- 
selves would infer that thi‘s was so. But actually they arc precisely 
equivalent, since both marks fall at the 25th percentile, Qj. The 
teacher may, of course, assert that the general standard of reading in 
that class is good and the geography poor. This is an objection which 
will frequently be raised by critics of this chapter, and we will show 
later why we believe it to be baseless. 

percentile Graphs . — One of the best ways of comparing two sets of 
marks is by a 7-poiut percentile graph, which is illustrated in the 
following example. 
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Mark 


Frequencies 
English Arithmetic 

90+ . 


. 1 

5 

80 + 


. 10 

11 

70 + 


. 15 

13 

60 + 


. 17 

15 

50 + 


. 8 

10 

40 


5 

2 

30 + 


1 

0 

20 + 


. 0 

1 


57 57 

Ex. 23. — The table shows the marks of 57 pupils, aged 12 +, in 
a Scottish school on final examinations in English and arithmetic. 
Though the tabulation is coarse, it indicates that the two distribu- 
tions differ in their means and S.D.s, and that, while the first is 
roughly normal, the second is negatively skewed. The proce^dure is, 
first, to determine the 99th, 90th, 75th, 50th, 25th, 10th, and 1st 
percentiles for each set of marks. From the original lists these are 
found to be 90, 84, 77, 67, 59, 49, 39, and 97, 89, 81, 71, 60, 53, 35, 
respectively. Draw a graph, as shown in Fig. 19, where English 
marks are plotted along the Y axis against arithmetic marks on the 
X axis. Plot these seven points (X 97, Y ^ 90; X 89, Y — 84; 



Fig. 19. — Comparison of two sets of school marks by means of a seven-point 

graph. 
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etc.), and connect them up by a straight line oi" smooth curve which 
runs as nearly as possible through them all. Ubis line may be ex- 
tended at 'both ends so as to cover the total ranges of marks. From 
it we can now read off the mark on the English scale which corre- 
sponds to each mark on the arithmetic scale, and vice versa. Thus 
41 in English is equivalent to 39 in arithmetic, 80 in English to 85 
in arithmetic. 

Often it is not necessary to determine as many points as seven in 
order to fix the graph. Three points, preferably the 90th, 50th, and 
10th percentiles, may give sufficient accuracy. Note that just the 
same procedure could have been used even if the distributions had 
differed widely in range and dispersion, e.g. if one subject had been 
marked out of 40, the other out of 200.^ 

Interpretation of Marks by Means of Percentile Levels , — The 
results many standardized tests of intelligence and vocational 
aptitudes are quoted in terms of percentiles, since the original or 
raw scores have little meaning in themselves ; for example — percent 
tiles relative to 1 3-year modern pupils, or percentiles relative to R. A.F. 
recruits, and so on. It has sometimes been suggested that the same 
system should be used for school or university examinations. The 
original marks would not be published at all. Average achievement 
would always be represented by 50, and very high or low achievement 
by marks approaching 100 and 0. Though this would be fairer in some 
ways, it would be difficult for pupils, students, or parents to grasp 
that these were always relative to the group which had taken that 
examination. Percentiles are, in fact, precisely equivalent to rank 
orders, except that they always run from 100 to 0 instead of from 1 
to N, The scheme would also not be feasible for examinations taken 
by small numbers, say less than twenty students (cf. p. 66). Never- 
theless it would be a considerable step forward if the mark lists for 
large-scale examinations, at hniversities, training centres or second- 
ary schools, included a table of the median, quartiles and some of 
the other percentiles for the marks which had actually been awarded. 
Teachers and students would soon learn the significance of these 
terms, and would then be able to see what standards of marking the 
various examiners were adopting. They could also obtain a much 
better idea of relative marks in different subjects than from the pre- 
sent supposedly absolute marks, which actually vary widely from 
one examination to another. 


^ Further discussion of this topic may be found in McIntosh et al, (1949). 
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Defects of Percentiles . — ^Although percentiles are invaluable for 
purposes of comparing distributions, they may be very misleading if 
used for combining distributions. The reason is that perceAtile units 
are very far from equivalent to one another, since they are far closer 
together at or near the median than they are at or near the extremes. 
It can be seen from the probability ‘integral’ graph (Fig. 18) that a 
percentile level of 90, i.e. the measure above which only 10% of the 

X 

measures lie, corresponds to an - or standard score of + 1-28. The 

a 

80th similarly corresponds to + 0*84, the 60th and 50th to + 0*25 
and 0-00. Thus the difference between the first two of these, 0-44, is 
decidedly greater than that between the second pair, 0-25. Actually 
the 99th, 94th, 78th, 50th, 22nd, 6th, and 1 st percentiles are all approxi- 
mately equally spaced, in the sense that a child who is at the 99th 
percentile is as superior to one at the 94th, as the 94th is to tlfe 78th, 
or the 78th to the 50th. The consequence of this inequality of units 
is that, although we may, if we wish, average or otherwise combine 
a person’s percentiles on different tests, we shall obtain thereby quite 
different results from those obtained by averaging his scores. 

Ex. 24. — Pupil A’s marks on two examinations lie at the 98th and 
54th percentiles; Pupil B’s marks on both examinations lie at the 
85th percentile. A’s average percentile of 76 is distinctly poorer than 

X 

B’s 85. But if these figures are converted into - scores, i.e. into units 

(7 

which really are equivalent, it will be found that A is actually slightly 

X 

better than B. For the - values of the 98th and 54th percentiles are 
h 2 054 and + 0100, average [ 1*077 ; whereas the ^ value of the 

O’ 

85th percentile is only + 1*036, 

It is advisable, therefore, never to use percentiles for combining 
marks. The same is true of rank orders (cf. p. 25-6). 

Adoption of a Standard Scale for All School or Examination Marks, 
— The following solution to this difficulty may be recommended. 
Each chief examiner, headmaster, or other authority who is con- 
cerned with combining, comparing, and interpreting sets of marks 
awarded by sub-examiners, class teachers, etc., should adopt an 
arbitrary standard scale. He might, for instance, choose a percen- 
tage scale ranging roughly from 30% to 90%, with a mean of 60% 
and a S.D. of 10. Now, on such a scale the 7 percentile points 



60 


THE MEASUREMENT OF ABILITIES 


STANDARD 



ENGLISH AND 
ARITHMETIC"/; 


Fig. 20. — Conversion of two sets of school marks into a standard scale, 

mentioned above fall at 83, 73, 67, 60, 53, 47, and 37 respectively. Each 
set of marks submitted by a teacher or examiner should be converted 
into this scale by the graphical method already described. If this is 
done, all the marks from all the markers will be strictly comparable 
with one another, and they will be expressed in units which are 
(sufficiently closely for all practical purposes) equivalent to one an- 
other, so that they can be combined or averaged at will. And there is 
the further advantage that the scale corresponds well to the present 
onventional conception of high and low percentages. Naturally any 
other standard scale can be chosen if preferred, and its percentile 
points determined from the probability integral graph.' Fig. 20 shows 
the graphs for transforming the two distributions of Ex. 23 into the 
suggested scale. 97 for arithmetic and 90 for English are plotted 
against 83 (the 99th percentile point) on the standard scale, and so 
on. The standard mark for any other mark can be read off. Thus 81 % 

X 

' The - values of the 99th and 1st percentiles are 4- and — 2*326 respectively, 
a 

of the 90th and 10th they are 4- and — 1-282, of the 75th and 25th they are 4- 
and - 0-675. 
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in arithmetic becomes 67% standard, and 81 % in English becomes 
71 % standard. 

Another way of achieving the same result is as follows. By apply- 
ing the method described in Ex. 20 (p. 50), a table may be drawn up 
showing the numbers of cases to be expected within each class- 
interval on the desired standard scale. Thus for a scale with a mean 
of 60 and S.D. 10, the following percentages of pupils should fall 
within successive class-intervals of 5 marks. 


Ex. 25.— 


Mark 

F% 

88-92 

Oori 


1 

78-82 

3 

73-77 

6i 

68-72 

12 

63-67 

17 

58-62 

20 

53-57 

17 

48-52 

12 

43-41 

61 

38^2 

3 

33-37 

1 

28-32 

0 or 1 

100 


Teachers or sub-examiners may be given such a table and in- 
structed to award marks to their pupils roughly in accordance with 
the frequencies there shown (precise conformity is not necessary). 
Or, alternatively, the chief examiner or head teacher may take each of 
the distributions of unsealed marks, award a scaled mark between 
88 and 92 to the top 0 or of the examinees, marks between 83 and 
87 to the next 1 % of them, and so on. Only very exceptionally good 
or poor candidates — those better or worse than 999 in 1,000 of the 
ordinary run — should be given marks of over 90% or under 30%. 
For in a normal distribution only a proportion of 0- 1 35 % is expected 

X V 

to score more than + 3-, or less than — 3-. 

O’ a 
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Ex. 26.— 

Frequency 
(a) (h) 

Exceptional 
cases only 

5 }■ 

6 3 

12 5 

18 7 

20 8 

18 7 

12 5 

6 3 

? }’ 

Exceptional 
cases only 


100 40 

Burt (1917) recommends an alternative standard scale of marks, 
based on a similar plan, but with an average of 10 and a maximum 
range of 0-20. The scale is given here with the frequencies to be 
expected when (a) 100 and (6) 40 pupils are marked according to it. 
Here also exact adherence should not be demanded. Any one set of 
marks for 40 pupils may depart from it fairly widely. But in the long 
run all the marks awarded by any one teacher or examiner should 
show the same mean, S.D., and tendency to normality as does this 
standard scale. 

OBJECTIONS TO RATIONALIZED MARKING 

There is no doubt that many teachers and examiners will object 
strongly to the principle upon which this chapter is based, that the 
average and the dispersion of sets of marks should always be the 
same in any one institution. Faculties in a university which consist- 
ently award high marks sometimes try to justify themselves by 
claiming that they get a ‘better run’ of students than do those which 
consistently award low marks. Examiners regard it as only natural to 
award different average marks to the answers to different questions 


Mark 
20 ] 
19 
18 ■ 
17 
16. 
15 
14 
13 
12 
11 
10 
9 
8 
7 
6 
5 

4] 

3 

2 ► 

1 

0 J 
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in one paper, since some questions are ‘done better^ than others. 
Teachers similarly regard their classes as ‘better at’ some subjects 
than at others, and so deserving of higher marks. Again, the class 
they have this year may seem to them an exceptionally ‘good’ or 
‘bad’ one, and the compositions written by the class may be ‘better 
or poorer than usual.’ Finally, those mathematical lecturers and 
teachers who are particularly prone to award very widely dispersed 
marks state that their students exhibit a ‘wider range of accomplish- 
ment’ than do students of other subjects. 

Justification for Equalizing the Average in All Sets of Marks , — 
Many of these claims are extremely dubious. Thus it is more likely 
that the university departments which awaid high marks will attract 
a poorer, rather than a better, run of students, and that the subjects 
which are marked more strictly will be pursued chiefly by the better 
students who care more for the subjects than for marks. Again, one 
is justified in suspecting teachers’ judgments about the merits of their 
classes, since some teachers appear, year after year, to be blessed 
with good classes, and others to be cursed with bad ones. According 
to what is popularly known as ‘the law of averages,’ one would 
expect the number of good pupils to be balanced in the long run by 
an equal number of bad ones. As a matter of fact, the statistical 
techniques outlined in Chap. V give us a method of predicting just 
how much classes are likely to vary from year to year, or English 
compositions from week to week, owing to so-called fluctuations of 
sampling. Here is one instance. A class of 45 pupils averages 60% in 
their general scholastic work on a scale of marks running roughly 
from 30% to 90%. From these data it may be stated that there is an 
even chance that the average level of next year’s class will lie between 
59% and 61 %, and that it is extremely improbable that next year’s 
will differ from this year’s by more than 3% or 4% up or down. 
Consider also the marking of English compositions, or of half a 
dozen or more separate questions in an examination paper. If the 
scale of marks employed ranges roughly from 4 to 20 (S.D. 3), then 
we can predict that the maximum variation likely to occur in the 
average marks for different compositions or questions is 3 marks, 
provided that there are at least 36 scripts, and only 2 marks if there 
are 81 scripts. An individual pupil may certainly range very widely in 
his level of performance on different questions, but the average level 
of a group of pupils is far more stable than is commonly supposed. 
The apparent variations in the goodness of answers are due much 
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less to variations in the pupils than to irregularities in the suitability 
of the questions or of the topics for compositions. But if an examiner 
sets a bad question which is beyond the scope of the majority of 
examinees, or which does not allow them to express their fullest 
capacities, there is no justification for giving low marks. It is there- 
fore much fairer to give the same average mark to the actual average 
script in all cases, as was suggested in Chap. II. 

It is a different matter when a group of pupils or students is known, 
from objective evidence, to be selected in such a way as to be, on the 
average, superior or inferior to the mean of some larger group. A 
special class for scholarship candidates and one for backward pupils 
will naturally show generally good and poor attainments respectively. 
But in many cases the teacher or examiner who claims to have a 
specially good or poor set of pupils or scripts is judging purely on 
the brfisis of subjective recollections of what he regards as average 
level. When a test or examination is applied to all the pupils of a 
certain age in a school, or in several schools, then the whole group 
should be marked according to the principles already outlined, and 
the marks for separate classes within this wider group may be picked 
out and averaged. Such averages will then quite justifiably show a 
certain range of variation. But when a test or examination is given 
only to a single class, it is far safer not to trust the marker’s subjective 
view of the class’s general level, but to award an average mark to the 
average script actually obtained from this class. Similarly the aver- 
ages for different school subjects should be the same, regardless of 
whether a class appears to be better at one subject than at another, 
except when objective evidence (derived from the application of the 
same tests or examinations to other, larger, groups of pupils) proves 
that the class’s attainments are uneven. 

The "Goodness^ or "Badrip^s^ of a Mark . — We still have to face the 
wider question : what do teachers and examiners mean when they say 
that a pupil’s work is good or bad, or an examination paper easy or 
difficult? For endless objections of the type discussed above will be 
raised until this is settled. Most persons who have marking to do 
obviously believe that they possess in their minds some absolute 
standard of goodness and badness by means of which they can 
evaluate a pupil’s or a class’s achievement. But numerous experi- 
ments have proved that such subjective judgments of merit or diffi- 
culty, when obtained from several different persons, are often widely 
discrepant. It was for this reason that we were forced to conclude, in 
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Chap. II, that all marking consists primarily in ranking a set of 
performances in order, and that it is a matter of relative comparison, 
not of absolute placement. The goodness of a child’s perforinance at 
some task can therefore mean only that he ranks high when his 
performance is directly compared with the performances of other 
children at the same task, under similar circumstances. An achieve- 
ment is good if only very few people can do as well or better, and it 
is poor if only very few people do it as badly or worse. Similarly, the 
rational definition of the difficulty of a task is the fewness of people 
who can accomplish this task. 

‘Fewness of people,’ in statistical terminology, connotes either 

X 

percentile level or, preferably, ' value. A mark is not necessarily a 

or 

good one because it happens to be 80% or 9/10 and the like. But it is 
good relative to the marks of some particular group of persons if it 
falls at the 95th percentile, or is 2a above the mean mark obtained 
b^ these persons, or if it stands high on a standard scale whose mean 
and S.D. arc known. Such designations of goodness or badness are 
entirely unequivocal. 

The Ease or Difficulty of Questions, — By approaching marking 
from this standpoint we obtain a rational scale, not only of the good- 
ness or badness of pupils’ work, but also of the ease or difficulty of 
questions or test items. If, for example, five items are answered 
correctly by 16%, 22^%, 31%, 40%, and 50% of a large group of 
persons, then we can state that (for these persons) the items are all 

X 

equally spaced in difficulty. For these proportions correspond to - 

a 

values of | 1-0, I 0-75, I 0-50, 1 0*25, and 0 0 respectively. 

Similarly some questions which are below average difficulty will be 
equally spaced if answered by, say, 91 %, 94%, and 96% of pupils, 
for here the proportions of incorrect answers, namely 9%, 6%, and 

X 

4%, correspond to ' values of — 1-35, — 1-55, and — 1-75 respec- 

a 

lively. 

Justification for Equalizing the Dispersion in all sets of Marks , — 
The claim that mathematical accomplishments vary more widely 
than do accomplishments in other subjects requires some further 
consideration. It is true that it is usually easier to discriminate 
between different levels of mathematical than of, say, English com- 
position, ability, because tests in the former can generally be marked 
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more objectively and accurately. Thus in the work of an ordinary 
school class it may be impossible to distinguish more than about ten 
different levels of essay writing, whereas perhaps a hundred levels of 
mathematical skill may be distinguished Although there is some 
justification for attaching greater weight to marks obtained by the 
more efficient measuring instrument, yet this greater efficiency is no 
criterion of the range of ability, as the following analogy shows. In 
a school quarter-mile race, one referee may time the boys with a 
seconds’ hand watch, and find that some take 50, others 51, 52, . . . 
60 seconds. Another referee may possess a tenth-of-a-second stop- 
watch and find that some boys take 50 0, others 50- 1, 50-2 . . . 59-9, 
60-0 seconds. But the range of ability is the same in both cases. 

It is only fair to mention that some authorities such as G. H. 
Thomson (1939) have suggested that greater variability would be 
founAin mathematical than in other abilities, if it were possible to 
measure variability absolutely. But there seems to be no satisfactory 
or convincing way of achieving such absolute measurement* in 
psychology. Perhaps it is wisest, therefore, to follow the convention 
that mental testers have adopted of assuming the same dispersion 
for all abilities in representative groups of all ages. 

The Marking of Small Groups , — Peculiar difficulties still remain in 
applying the principles and techniques of this chapter to the marking 
of small groups. The mean and the S.D. of marks among, say, half 
a dozen Honours students may vary far more from year to year than 
they do among 40 or 100 students. If the average is taken as 60% 
this year, the odds are even that it will lie somewhere between 57 and 
63 next year, but it may drop as low as about 50%, or rise as high as 
70%. And if there is only one candidate who is awarded 60% this 
year, a single candidate next year may deserve anywhere from 30% 
to 90%. Until such time as objective tests are available, standardized 
on Honours students from many universities, or over several years, 
the examiner can only employ subjective marking scales, as indicated 
in Chap. II (p. 35). 


TEST NORMS 

Test norms are tables of the scores of large groups of children or 
adults by reference to which it is possible to interpret the significance 
of the score obtained by any person, or group of persons, to whom 
the test is applied. The mental tester fully admits that a test perfor- 
mance, as such, tells us nothing as to its goodness or badness, and 
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that it must be compared with the performances of other similar 
persons tested under the same conditions. Thus some tests are stan- 
dardized, and norms are obtained, by being applied to large groups 
of typical 1 1-year junior school children ; others are given to school- 
leavers aged 14 to 15, or to grammar school pupils; others to adult 
Army recruits, or to university students, and so forth. American 
testers often publish grade norms, i.e. the standards obtained by 
successive school classes in their elementary and high schools. Here 
too a number of schools representative of all levels must be chosen, 
on account of the considerable variations in standards between 
different schools. 

A mere statement of the average or median scores of such groups 
IS, of course, insufficient. Often the deciles are quoted, so that it is 
possible to state that a person’s performance is, say, as good as that 
of the best 20% of the group upon which the test was standa^rdized. 

A^system based on - values, or standard scores, is preferable since, 

as shown above, percentile units vary widely in magnitude at differ- 
ent parts of the scale. A system which is becoming widely adopted is 
first to determine the percentiles and then to convert these into 
standard scores with an arbitrary S.D. (usually 10, 15 or 20) and an 
arbitrary mean (usually 50 or 100). By this device the converted 
scores may be regarded as an equal-unit scale, even though the 
original distribution of raw scores was markedly non-normal. Thus 
with a S.D. of 15 and a mean of 100, whatever raw score falls at the 
84th percentile is converted to 115. The range of standard scores on 
this scale would be from about 145 to 55. 

It should by now be obvious that the usefulness of such norms is 
strictly limited by the nature of the standardization group. Norms 
are purely and simply a summary of the test performances of a 
particular group of testees, and must be interpreted with reference to 
that group. For example, if an American educational test is given 
to a British child and his score falls at the 70th percentile according 
to the norms published with the test, this means that he did better 
than 70% of the American children who took the test, but does not 
necessarily show that he is above the British median or average. In 
point of fact American educational tests are almost useless in this 
country because their standards differ so greatly from ours (cf. 
McGregor, 1934), and, of course, because their content and wording 
are often unsuitable. Probably the same is true of group verbal 
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intelligence tests, though non-verbal tests and individual scales such 
as Terman-Merrill and Wechsler-Bellevue (with minor adaptations) 
seem to give comparable results (cf. Chap. IX). Even between Eng- 
land and Scotland the differences are so big that few tests which 
have been standardized in one country can be used in the other, 
unless fresh norms have been collected (cf. Vernon, O’Gorman and 
McLellan, 1955). The appalling efficiency of Scottish teachingmethods 
raises the performance of Scottish elementary school children on tests 
of the 3 R’s by anything from 10 % to 50 % over the English level. Such 
evidence as we possess, however, points to no appreciable differences 
on tests of general intelligence. 

Many tests have been adequately standardized in London or 
Glasgow, but are not appropriate for use in the provinces or in rural 
areas where the educational level, and possibly also the intelligence 
level, may be different. Probably it would be better to provide more 
than one set of norms for such different types of school, so that a 
child’s test performance could be referred to the most suitable set. 
Apparently even the size of school has some effect upon educational 
level (cf. Mowat, 1938). Still more potent is socio-economic status, 
since this is known to have a marked effect upon educational and 
intellectual levels.^ Thus if a test has been standardized by being 
applied only to children in a rather poor area, its norms will not be 
appropriate for schools in a relatively superior area. Adequate 
norms for, say, children aged 11 |- would be obtained only if all 
types of schools containing pupils of this age were adequately 
represented in the standardization group. Norms derived only from 
one type of school may still be of some use, but the tester must then 
remember to interpret his testccs’ results with reference to this type. 

The origin and the adequacy of many of the published test norms 
are so dubious that we would advise testers in schools to treat them 
all with caution, and when possible to do without them. Once a whole 
school class, or all the children of a certain age in the school, have 
been tested, each child’s standing relative to the rest of this group 
will be known, and will provide most of the information needed for 
educational diagnosis and prognosis, without reference to any outside 
standards. And if the same tests can be applied in a school for 

' Cf. Burt (1937). The reasons for this effect are probably partly environmental 
and partly hereditary. Poor surroundings and upbringing undoubtedly have 
many adverse influences upon education. But in addition there is indisputable 
evidence of a connection, albeit a small one, between intelligence and social class, 
quite apart from environmental influences. 
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several years, an extremely useful set of ‘local’ norms could quite 
easily be collected, either in terms of percentiles or of o- scores. 

AGE NORMS 

Age norms may be defined as follows. A child who obtains a 
Mental Age or M.A. (on an intelligence test), or an Educational Age 
or E. A. (on an educational test) of x years is one whose performance 
is as good as that of the average child of Chronological Age or C. A. 
of ;c years. For example, a child aged 10*0 may do as well in an 
intelligence test as an average ll*0-y ear-old, but only score up to the 
level of a 9 0-year-old on tests of reading and arithmetic. He is then 
said to possess C.A. 10 0, M.A. 11-0, Reading and Arithmetic Ages 
of 9-0. At first sight this system of expressing test results is far simpler 

X 

than the percentile or systems. It provides a uniform scale t)f units 

by means of which a child’s relative performances on any number of 
tests can be compared, flis strong or weak points in different school 
subjects, or even in particular aspects of school subjects (e.g. the 
main arithmetical skills), and his superiority or inferiority on verbal 
and non-verbal intelligence tests, etc., can be seen at a glance from 
a table or graph of his various E.A.s and M. A.s. Nevertheless, there 
are many serious weaknesses in this type of norm, which we shall 
have to describe in detail. 

Difficulties in Establishing Age Norms . — First there are almost 
insuperable difficulties in collecting standardization groups which 
will be truly representative of all the children of a certain age. Be- 
tween 6 and 11 years the primary schools will provide groups whose 
average scores arc probably close to the averages in the total popula- 
tion. Any one age group will, of course, be scattered through a large 
number of different school classes. But the range of ability in primary 
schools is restricted, since some of the brightest and dullest sections 
of the population attend private and special schools respectively. 
And, as we have seen, the range or dispersion of a distribution is 
quite as important as its average. Beyond 11 to 12 years, when 
primary school pupils proceed to various types of schools, it becomes 
increasingly difficult to secure samples whose averages, let alone 
ranges, are likely to be typical. In a very few instances results have 
been obtained from practically all the children in a given district,^ 


M.A. — 6 


Cf. Fraser Roberts et al. (1935). 
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ot fjrom ail the children born on certain dates.^" But often the best 
that can be done is to control the socio-economic factor. Suppose 
that a district contains a hundred schools, and that a tester wishes 
to standardize a test on pupils in ten of them, he should see that the 
schools he selects cover the whole range of socio-economic level, in 
due proportion. An extension of this method was used by Cattell 
(1934) in establishing adult norms for an intelligence test. His testees 
were classified according to their occupations, and the numbers at 
each main occupational level were compared with the corresponding 
numbers in the total population. Since he had an unduly high pro- 
portion of professionals, and too low a proportion of labourers, he 
reduced the weight of the former and increased that of the latter 
group, before calculating the mean and the S.D. of test scores for his 
combined groups. 

The same difficulty in obtaining representative samples hinders us 
from tracing accurately the course of intellectual and educational 
growth from childhood, through adolescence, to adulthood. The 
available evidence, however, shows fairly definitely that this growth 
is not entirely regular, which means that M. A. and E. A. units are not 
equivalent throughout. The year by year increase of intelligence in 
an average person seems to be reasonably constant from about 3 to 
10 years, after which the rate of increase diminishes, and the M.A. 
units become progressively smaller, until a constant level is reached 
somewhere in the neighbourhood of 15 years. Probably this upper 
limit is reached at different ages on different tests, and by different 
individuals. When the growth curves of particular children have been 
followed through, they have shown considerable irregularities (Dear- 
born and Rothney, 1941). Average adolescents may show no further 
improvement after about 12 on a simple performance test but, pro- 
vided their education continues, they may go on advancing on 
complex verbal tests even up to 18 or 20. The upper limit for a very 
dull child is, of course, far lower; he may never surpass the per- 
formance of an average 10-year-old on any test. It is not yet certain 
whether he reaches this limit at the same C.A. as an average child 
reaches his, or somewhat earlier. A very bright child, on the other 
hand, naturally improves well above the normal 15-year level, and 
so we express his intelligence as M.A. 16, 17, . . . up to 22 or even 
higher. But these units are entirely artificial. A 20-year M.A. does 
not mean that a testee's performance is equal to that of an average 
^ ^ Cf. Scottish Council for Research in Education (1933). 
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person aged 20, but only that his performance is a good deal better'" 
than the performance of average persons aged anything from 15 
years upwards. We must conclude, then, that M.A. units are so com- 
plex, and likely to be so uneven above about 12 years, that it would 
be far better to discard them altogether at such levels. 

Lack of Equivalence of M.A. and E.A. Units . — Still less is known 
about E.A. units. Possibly the increase is fairly steady from about 
6 to 10 or 11 years, but then there is a strong tendency in most 
schools to force children unduly, in order that they may pass the 
secondary school selection examinations. Thereafter some relaxa- 
tion occurs, at least on the formal subjects. Hence the 10- to 11-year 
unit is likely to be exceptionally large and the 12- to 13-year unit 
exceptionally small. The units may then tail off among those in 
modern schools, but among grammar school pupils and university 
students there is likely to be progress well into adulthood. Clearly 
then E.A. and M.A, units do not provide comparable scales much 
above 10 years. A child of 13 with an M.A. and an E.A. of 14 is not 
equally superior in intelligence and in education, since 14 minus 13 
is a bigger quantity in the latter than in the former. Nor, of course, is 
this superiority of I year equivalent to the superiority of a child of 7 
whose M.A. and E.A. are 8 years. 

Not long after Binet put forward the M.A. system of scaling in- 
telligence, the discovery was made that a child’s relative advance- 
ment or retardation do not remain constant, but increase as his age 
increases. Burt (1917) proved that the same was true of E.A. A child 
of 5 with an M.A. or E.A. of 6 will usually show an M.A. or E.A. of 
12, not of 11, when he reaches C.A. 10. In other words, a superiority 
of a year in a young child is much greater than the same superiority 
in an older child. But the ratio of M.A. to C.A., or E.A. to C.A., 
does seem to remain approximately constant, hence the I.Q. and 
E.Q. have been very widely adopted as measures of intellectual and 
educational advancement and retardation. 

Ambiguity in I.Q. Units . — There is yet another flaw both in M.A. 
and I.Q. and in the corresponding educational units, namely that the 
range and S.D. of scores expressed in these units are wider for some 
tests than for others. For example, the application of educational 
tests to a class of 9-year-olds might show that their average E.A. was 
9, and that 20% of them had E.A.s of 10 or more. But the application 
of a group intelligence test to the same pupils might show an average 
M.A. of 9 and as many as 30% with M.A.s of 10 or more. If this was 
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SO, then clearly an advancement of 1 year, or an I.Q. or E.Q. of 1 1 1 

(i.e. X 100), does not mean the same thing on diflerent tests. Some 

intelligence tests yield I.Q.s with a S.D. of 15 or even less, others 
yield I.Q.s with a S.D. as high as 25 or more. Probably educational 
tests show similar irregularities. The S.D. of the old Stanford-Binet 
test I.Q.s was established as about 15, and many modern tests such as 
the Moray House group tests and the Wechsler Intelligence Scales 
for Children and Adults have adopted the same figure as a conve- 
nient one. Terman-Merrill I.Q.s tend to be more widely dispersed, 
their average S.D. being quoted as 16^. But it varies from 12J to 20 
at different age levels. This means that I.Q.s (with the same mean of 
100) have different meanings from one test to another, or on the same 
test at (Jifferent ages. An ordinary school class will commonly obtain 
a range of I.Q.s from about + 2a to — 2a, i.e. from 130 to 70 on a 
test with a S.D. of 15. But the same class tested with a group te?t 


whose S.D. is 25 will range in I.Q. from 150 to 50. To take another 
example: a class of very bright children may have an average I.Q. of 
1 15 on a Moray House test. The same class tested with another group 

test may have an average I.Q. 

of 125. The following table illustrates 

the proportions to be expected 

in the total population with I.Q.s 125 

or more, or 75 or less, on tests with various S.D.s. 


% I.Q. > 125 

S.D. 

or < 75 

m 

2i 

15 

5 

16i 


20 

11 

25 

16 


Unfortunately it is impossible to assert that any particular value 
of the S.D. of I.Q.s is the ‘right’ one, and that the other values are 
too large or too small. The value depends on certain features of the 
test which are still somewhat obscure, and are too complex for dis- 
cussion here.^ The figure 15 is implied when we talk of 70 as the 
border-line for mental deficiency, since most of the testing of border- 

‘ The main factor seems to be the degree of heterogeneity among the items or 
subtests. The more homogeneous the test material, the higher will be the S.D. 
of I.Q.s. Hence tests composed mainly of verbal reasoning items like Cattell’s 
Scales II and III tend to have higher S.D.s than more miscellaneous tests like 
Stanford-Binet. But in addition certain types of items seem to be intrinsically 
more ^spreading* than others. 
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line cases has been carried out with the Sttinford-Binet test whose 
S.D. kept fairly close to 15. The only satisfactory solution to this 
problem is to give up the Mental and Educational Age system of 
calculating l.Q.s and E.Q.s, and to establish separate percentile or 
standard-score norms at each age level for which the test is suitable. 
The conventional form of I.Q. has been, and is still being, used far 
too loosely, as though it represented a standard scale of measure- 
ment, independent of the particular test employed. Actually, like 
any other test score, its interpretation is possible only if its S.D. (as 
well as the mean of 100) is taken into account. The derivation of 
standard scores may be somewhat more difficult for untrained 
testers to grasp. But such scores do have a uniform significance with 
different tests at different ages, and— when the S.D. of 15 is adopted 
—they do provide a scale which is very closely comparable to the 
older, generally accepted, scale of l.Q.s. * 



CHAPTER V 


RELIABILITY AND THE THEORY OF SAMPLING 

Untrustworthinessof Results obtained from Small Numbers of Persons,— 
This chapter is likely, we must admit, to cause considerable difficulty to 
readersunfamiliar with statistical concepts. Itintroducesapoint of view 
with regard to educational measurements and experiments which may 
at first seem very complex and involved, yet which is absolutely essen- 
tial to a proper understanding of scientific education and psychology.^ 

The scientific investigator naturally wishes to arrive at results 
which are true for people in general. He wants to prove that a certain 
method of teaching gives superior results with all children, that the 
correlation 2 between intelligence tests or school examinations and 
subsequent scholastic achievement always reaches a certain figure, 
or he wants to know the average test score of all persons brought up 
in a particular environment, c.g. of rural as contrasted with urban 
children, or Scottish as contrasted with English, and so on. But he 
can seldom, if ever, measure all the persons to whom his conclusions 
are supposed to apply. He is forced to work with samples of the 
population, and he is therefore entitled to regard his result — whether 
it be an average, or a difference between two averages, or a correlation 
coefficient, etc. — only as a sample result, which might alter to some 
extent if he repeated the work on other samples of the population. 

He must, of course, take precautions in the first place to see that 
his sample is not what we call a biased one. Supposing his problem 
is the securing of norms for a group intelligence test, and he wishes 
to find the average and the dispersion of scores among children aged 
11 +, he must, as we saw in the previous chapter, avoid choosing 
only schools in poor areas or only private schools. But even when 
such bias in selection is eliminated, he would still have to content 
himself with a limited sample of the children aged 11 + in the 
country as a whole, and the average of this sample might differ to 
some extent from the true average. Similarly if his problem was the 

' A particularly clear explanation has been given by Thouless (19396). 

*The reader who is unfamiliar with the conception of correlation may be 
. advised to study pp. 92-5 of the next chanter before continuing with this 
chapter. ^ 
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coiliparisda of the intelligence of children in suburban and industrial] 
areas, he could not measure all the schools of these types. Thus the^ 
difference which he obtained would only be an approximation to the 
difference that might result from much more complete investigations. 
Here are two concrete instances. 

Ex. 27. — ^Twelve of the writer’s training college students were 
given an American objective test of reading ability, namely the 
Nelson-Denny test, which measures knowledge of word meanings 
and comprehension of paragraphs. Their scores, arranged in order, 
were: 140, 136, 135, 133, 126, 120, 109, 108, 98, 89, 87, 82. 

Now, most of these scores are well above the average for American 
college students, namely 96. Their average, 113*6, falls at the 78th 
American percentile. But obviously the scores are very variable. 
The question is, then, are we or are we not Justified in deducing from 
them that Scottish training college students possess higher reading 
ability than American college students? Obviously no such deduc- 
tion could be made from the score of only a single student. But does 
the average of a sample of 12 provide a much surer basis? The 
12 were picked at random from among the writer’s 264 students. 
Thus he was careful to avoid selecting only those, or a prepon- 
derance of those, who had an Honours degree in English. And men 
and women were equally represented, since it was conceivable that a 
sample composed only of women or only of men would be biased. 
Yet it is quite possible that he unintentionally picked an undue 
number of clever students, and that if he picked another dozen, 
their scores might be mostly below 100, and their average no higher, 
or even lower, than the American figure. How liable then is such 
an average of twelve measures to vary? 

The question can be answered in this instance, since the same test 
was given to 21 other sets of 12 students, also chosen at random. 
The 22 averages range from 103*6 to 126*8, their distribution being 
shown in the following table. As we suspected, the averages are 
rather variable, though less so than the separate scores, which range 
all the way from 62 to 160. And though our first average of 113*6 
happens to be fairly close to the grand average of 1 14*9, yet we might, 
in the luck of the draw, have obtained an average as much as 10 
points above or below this figure. 

ige Mark Frequency 

124 + . .1 

121 + . .3 

118+ . .2 

115+ . .3 

112 + . .8 

109 + . 2 

106 + . .2 

103+ . . J_ 

22 
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Ex. 28. — ^The correlation coefficient was calculated between the 
marks of 25 students on an examination given near the beginning of 
the year, and their marks on another paper at the end of the year. 
The coefficient was + 0*53. But 25 students is only a very small 
sample, and similar results calculated for 13 other similar samples 
varied widely, as shown in the following table. 


Correlation 

Coefficients 

F 

+ 0-60 to + 0-69 . 

. 1 

+ 0-50 to + 0-59 . 

1 

+ 0-40 to + 0-49 . 

. 2 

“1“ 0*30 to 4" 0*39 . 

5 

+ 0*20 to + 0*29 . 

. 2 

+ 0*10 to + 0*19 . 

1 

0*00 to + 0*09 . 

1 

-O-IOto-O-Ol . 

1 


Thus the original figure is far from trustworthy, and even thf 
average coefficient of + 0*315, for 350 students, though probably 
much more reliable, might also alter to some extent if the same 
examinations were compared among other similar large groups of 
students. 

The Scientific Standpoint . — ^The scientific investigator, therefore, 
always looks on any numerical result as a sample one which may 
deviate more or less from the truth. Even when he has excluded errors 
of selection such as would make the persons he is measuring in any 
way unrepresentative, he has to consider the probable extent of these 
so-called chance errors of sampling. He therefore goes on to assume 
that it is theoretically, though perhaps not practically, possible to 
obtain a very large number of other such samples. The two tables 
given above are instances on a small scale. They both bring out a 
vital step in our argument, namely that such additional samples tend 
to conform to normal distributions. The numbers involved are small, 
hence these distributions are irregular, but there exists ample evi- 
dence to show that larger numbers would eventually yield smooth 
normal curves. 

Let us take a fresh instance, where only a single sample result is 
available, in order to illustrate the further stages of the argument. 

Ex. 29. — ^The average scores for 136 men and 1 14 women students 
on the Cattell Intelligence Test III (cf. Ex. 3, p. 10) were 101*55 and 
99*65 respectively. Thus the men were 1*90 points superior. 

Can we deduce that men students, of the type tested, arc in general 
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better than women, or would we, if we applied the same tests to 
many other similar groups, find the difference to be larger in some 
cases, negligibly small in other cases, or even reversed in others? 
This difference, which we shall call d, is only a sample one, which 
should be regarded as falling at some point on the distribution of a 
large number of possible alternative J's. Clearly the best possible 
estimate of the true difference between men and women would be a 
grand average of all these hypothetical ^/’s. This quantity — the mean 
of the distribution of d \ — we shall call Z). The particular d which we 
know is a more or less accurate approximation to D. 

A formula has been worked out by means of which it is possible 
to predict, from the actual data available, what would be that is 
the S.D. of this distribution of r/’s. The formula will be given later. 
In the present instance it yields — 141. As the distribution is a 
normal one, this figure will enable us to define its limits. We caTi state 
that about one-third of the c/’s must fall between D and D + 141, 
another third between D and D — 141, and that about one-third of 
them must deviate more widely from D. Scarcely any of them, or 
only 0-135%, will be as large as Z) + 4-23 (i.e. 3c7), and only 0-135% 
will be as small as Z) — 4-23. Hence 99-73% (i.e. 100 — 2 x 0-135) 
of the c/’s lie within the limits D 4-23, and the odds against any 
c/ lying outside these limits are 99-73 to 0-27, or 370 to 1. As a general 
rule, the probability is 370 to 1 that any d which deviates from D 
purely on account of chance errors of sampling will not do so by 
more than 3or,,, 

Certain other probabilities have been deduced from the probability 
integral graph (Fig. 18), and are listed in the following table: 



Limits 

Frequencies of c/’s 
falling outside these 
limits 

Odds against c/’s 
falling outside 

p 

value 

+ 

0-6745 (T„ 

2 X 25-0% 

1 to 1 

-50 

-h 

1-00 <T„ 

2 X 15-9% 

rly 2 to 1 

-30 

-t- 

1-645 cr„ 

2 V 5-0% 

10 to 1 

-10 

+ 

1-96 

2 X 2-5% 

20 to 1 

•05 

-f 

2-58 

2 X 0-05% 

100 to 1 

•01 


3-29 a„ 

2 X 0-005% 

1,000 to 1 

•001 


Now these predictions must apply to the particular d which we 
happen to know, whose value is 1 -90. The odds are very much against 
its deviating as much as 3-29 or„ from D, but it might quite possibly 
deviate as much as, say, 1-96 o-^, since the odds against this are much 
lower. We can therefore, as it were, reverse the argument, and state 
with practical certainty that D must lie between 1-90 ± 3-29 cTp, i.e. 
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between + 6-54 and —2*74, We do not know, of course, what is the 
value of JD, but if it fell outside these limits, our d would be more 
than 3*29 or„ removed from it, and the odds against this are 1 ,000 to 1 . 
Similarly the odds are 100 to 1 that it lies between 1-90 ± 2-58or„ 
= + 5'54and — 1.74; and 20 to 1 that it lies between 1*90 ± l-96or„ 
= + 4-66 and — 0-86. These latter limits are commonly referred to 
as the -01 and -05 confidence limits, respectively, of our obtained d. 

The Standard Error as an Index of the Trustworthiness of a Result . — 
It may be seen then that provides us with an index of the margin 
of uncertainty in our d. It tells us that the difference of 1*90 in favour 
of men students is by no means trustworthy, since the true value of 
D may be considerably larger or smaller. 

This is the manner in which the scientific investigator always thinks 
of the results of psychological or educational experiments. His ob- 
tained result, being only a sample based on a limited number of cases, 
is likely to deviate to a certain extent from the true result which ^he 
would obtain if he could experiment on a very much larger set of 
people. He can, however, work out the often referred to as the 
Standard Error or S.E. of his result, which tells him, not precisely 
how much his result is in error, but what is the probability that his 
result is in error by various amounts. There is a 2 to 1 chance that the 
error may be no bigger than the S.E., and the odds against the occur- 
rence of a larger error rise very rapidly, so that it is extremely unlikely 
to be greater than three times the S.E. 

The Probable Error. — The figures at the head of the table, above, 
show that there is an ev^n chance of D lying within, or lying outside, 
the limits + 0-6745a„. This particular fraction of the S.E. is known 
as the Probable Error or P.E. The name is not a very good one, but 
in a sense it is the most probable error, since the odds against either 
larger or smaller errors than this are higher than 1 ; 1 . In our example, 
the P.E. = 0-6745 x 1-41 = 0 95. This means that D is just as likely 
to lie between 1 -90 + 0-95 and 1 -90 — 0-95 as it is to lie outside these 
limits. 

The P.E. is frequently used as an index of the unreliability or 
margin of uncertainty, in preference to the S.E. Instead of saying 
that the odds against D deviating from d by 3-29 x S.E. are 1,000 
to 1, we can say that a deviation of 5 x P.E. possesses this degree 
of unlikelihood, since 5 x 0-6745 is approximately equal to 3-29. 
Or if ± 1*96 X S.E. are regarded as the limits within which D may 
reasonably be expected to^lie, we can alternatively denote these limits 
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as + 3 X P,E, Often a numerical result such as our rf of 1*90 is 
stated in the form : 1-90 ± 0-95, where the latter figure is the P.E. of 
the former. While both the S.E. and the P.E. indicate the likely 
amount of error in some numerical result, the S.E. is, as we have 
seen, primarily a measure of the dispersion of all the possible sample 
results. The same is true of the P.E. In the hypothetical distribution 
of sample d% the limits D + P.E. to D — P.E. enclose 50% of these 
d\ and 25 % of them are greater than D + P.E., 25 % less than D — 
P.E. Note therefore that the P.E. is identical with g, the semi-inter- 
quartile range, of this distribution. For the limits to Q 3 also mark 
off the middle 50% of measures in a normal distribution. 

The Statistical Significance of a Difference. — Many experiments 
have shown that the scores of comparable groups of men and women 
on intelligence tests are practically identical, or that any differences 
found are not reliable. Since, in our example, the odds are only 21 
to I against D being greater than 4-72 or less than ~ 0*92, it seems 
quite possible that here too the d is not reliable, and that the true 
value of D is zero. If this be so, then it is easy to determine from the 
probability integral graph the chances that a sample might show a d 
of + 1-90, merely through errors of sampling. 1-90 is 1*35 X 141, 
or 1 - 35 ( 7 ,,. Consulting the graph (Fig. 18), we see that 9% of measures 
in a normal distribution lie at or beyond l-35cr from the mean. Hence 
when D is zero, 9% of the sample d's amount to + 1*90 or more, 
and 41% lie between 0 and f 1-90. Thus the odds against a d of 
1-90 being due to errors of sampling are only 41 to 9, or about 4^ to 
1 .^ In other words, it might occur as a matter of chance once in four 
or five testings of similar samples of students. Such odds are far too 
low for us to put any trust in the conclusion that men students are 
better than women at this test. Had J been 3-63 (i.e. 2-58(7p) we might 

^ The acute reader may well ask here: why not contrast the 9% of sample d's 
of + 1 90 and over with the 91 % of smaller and of negative rf’s, thus arriving 
at odds of about 10 to 1 ? This procedure is denoted as the one-tailed test, since 
it is based on the top end or tail of the normal distribution of d's ; whereas the 
test described in the text is the two-tailed. Many statisticians hold that the one- 
tailed test should be used when it is desired to show that a difference is as large 
or larger than a certain amount, whereas the two-tailed test is appropriate when 
attempting to prove that the difference lies outside certain limits in either 
direction — positive or negative. Others, however, prefer to use the two-tailed 
test under almost all circumstances, since it provides a more conservative 
estimate of the trustworthiness of their results. By following this practice we are, 
in effect, implying that the true sex diTerence might be in favour of men or 
women, rather than that the true difference might be in favour of men or else 
be zero. 
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have felt reasonably certain, since this difference could arise by 
chance only once in a hundred times. It would then be referred to as 
statistically significant at the *01 level. Note that ‘significant’ does 
not necessarily imply ‘meaningful.’ We are not concerned here with 
the psychological interpretation of any difference. Statistically sig- 
nificant refers purely to the reliability or trustworthiness of the figures. 

A difference between the results of two groups, divided by its S.E., 
is often called the Critical Ratio or C.R. In this example the C.R. is 
only 1-35. It is customary to place little reliance in a difference whose 
C.R. is less than 2 or preferably 3. Alternatively a is regarded as 
significantly greater than zero if it exceeds 3, or preferably 5, times 
its P.E. But the border-lines are quite arbitrary matters of conven- 
tion; we are certainly not justified in rejecting all smaller differences 
because they are conventionally non-significant. Suppose the C.R. 
was 1-645 {d = M9), with a probability of 1 in 10. Though this 
might easily arise by chance, it does not prove that the true value of 
D is zero. Indeed, according to the -01 confidence limits, the true 
value might lie anywhere between + 4*82 and — 2*44. Such a critical 
ratio is said to be of borderline significance, since it suggests that a 
true positive difference might be slightly more likely to occur than a 
zero or a negative one if we were able to investigate further samples. 

Another note of caution is necessary. Even if d had been large 
enough in the above example (relative to its S.E. or P.E.) to be 
accepted as significant, the conclusion that men are somewhat 
superior at the test would have applied only to the type of students 
tested, namely Scottish university graduates who had chosen the 
teaching career. It could not be extended to English training-college 
students, nor to other university students, far less to men and women 
in general, since the persons actually tested could not be regarded as 
typical or representative, of these wider groups. Again it would be 
safer not to claim that such a difference was a difference in intelli- 
gence, but only in the ability which is measured by this particular 
group intelligence test. The statistical treatment which has been 
applied to the distributions of scores only informs us how sound and 
reliable are the results in respect of other similar testees, i.e. it takes 
care of sampling errors, but it does not eliminate errors or bias in 
selection. As mentioned in Chap. I, statistical methods cannot im- 
prove data which are in any way inaccurate or defective ; they merely 
bring out the significance (or lack of significance) of the data actually 
obtained. 
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FORMULAE FOR THE STANDARD AND PROBABLE ERRORS OF 
NUMERICAL RESULTS 

Common sense tells us that a numerical result such as an average, 
or a correlation, or a dilTerence between two averages, will always be 
more reliable the greater the number of cases on which it is based. 
In Ex. 27 (p. 75) it was shown that the averages obtained by sets of 
twelve students on a test of reading ability were exceedingly variable. 
Obviously twelve is far too small a sample upon which to base con- 
clusions. In order to make a trustworthy comparison between the 
performance of Scottish and American students at this test, we 
should need much larger numbers. Most of the formulae for relia- 


bility involve the quantity which means that by quadrupling 

the size of a sample, we can halve the variability of any result com- 
puted from this sample, or double its reliability. Another important 
faotor which governs the reliability of a result is the variability among 
the original measures. Suppose that, in Ex. 27, all the men students 
had scored 101 or 102 (average 101*55), and all the women had 
scored 99 or 100 (average 99*65), we should certainly have been 
entitled to claim that men students do better at the test than women. 
But it is because the men’s scores range all the way from 73 to 134, 
and the women from 78 to 129, that we felt suspicious of the difference 
in averages, and applied statistical treatment to it. There was so much 
overlapping that about 42% of the women obtained higher scores 
than the average man, and about 42% of the men scored lower than 
the average woman. Hence the reliability formulae also involve some 
measure of the dispersion or extent of variation of the original 
measures. 

The S.E. of a Difference , — This S.E. is given by the formula: 





n; 


Here o-j and are the S.D.s of the scores in the two groups, Ni and 
are their numbers. The P.E. of a dilTerence is 0-6745 times this 
quantity. 


Ex. 30. — As a fresh illustration, let us compare the scores on the 
Nelson-Denny reading test of students having Honours degrees in 
Arts with the scores of students having Honours in Science or 
ordinary degrees. The essential data are as follows : 



82 THE MEASUREMfeWT OF ABRItJES 

N M 0 

Honours Arts students 66 129 47 16-42 

Other students 198 109-97 18-9C 

The difference between the two averages is 19-50. 

'• = J'-^ + = V4 088 f 1 804 = 2-427 

19 5 

Here the cntical ratio is approximately 8 Such a difference 

could not possibly occur through chance errors of sampling Hence 
we may conclude that Honours Arts students, of the type tested, are 
definitely superior to other graduate students at this test 

S E. of a Difference bet\^een Averages oj Scores which are Inter- 
correlated — When one group of persons takes two tests, or if they 
repeat a test, and we wish to examine the statistical significance of 
the difference between their two average scores, we must US'" a 
modified formula This applies whenever there exists a correlation, 
r, between the two sets ot scores ' 



Ex. 31 — Marks in English and arithmetic were awarded to a class 
of 57 children These are tabulated in Ex 23, p 57 Apparently the 
arithmetic marks run higher than the English ones Is the difference 
significant? The figures are as follows 

• Mean Difference a N 

English 67 615 - 12 94 

Arithmetic . 70-510 ^ 14 64 

The correlation between the two sets of marks is indicated by a 
coefficient of + 0 542. 

= -^^( 12^2 + 14 642 - 2 X 0 542 y 12 94 x 14 64) 1 761 

The difference of 2 895 is only 1 64 times its S E , hence it is not 
significant. 

S.E. of a Mean — An average is, as we have seen in Ex 27, quite 
as likely to vary through errors of sampling as a difference between 
averages. Hence it is very useful to know how near to the true 
average of a very large group is likely to be the average obtained 

^An alternative version of the formula, which obviates calculating the 
correlation coefficient, is given below m the section on small samples (p. 85) 
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from a group of limited size. The S.E. of the mean is often symbol- 
ized by or^, and is given by the formula: ^ Here* a is, of 

VN 

course, the S.D. of the original scores, whose average has been 
computed. 


Ex. 32.— The twelve scores on the Nelson-Denny test, listed in 
Ex. 27, have a mean of 1 13*58 and S.D. of 20*07. 




20 07 

VT2 


5*794. 


The P.E. of M is 0*6745 x S.E. 3*908. 

Thus the true mean for students of the type tested may just as well 
lie outside the limits 113*58 ± 3*91 (i.e. 117*49 to 109*67) as within 
them. It may be as much as 2*58or,, away from 1 13*58, i.e. as high as 
123*66 or as low as 103*50, but it is unlikely to exceed these 0*01 
confidence limits. Obviously the figure is extremely unreliable. 

Py contrast the mean for 264 students was 114*9 and the S.D. 


20*15 

20*15. Here " 7 = 1*24. The sample is 22 times as large, 
V 264 

hence is . ^ its previous value. The P.E.,, = 0*836. From these 

V22 

figures It follows that theie is an even chance of the true mean lying 
between 115*7 and 114*1, and that it almost certainly does not lie 
outside the limits 118*1 and 111*7. 


S,E. of a Standard Deviation and of a Difference between S.D.s . — 
Not only the mean but also the S.D. of a set of measures may be 
affected by errors of sampling. The formula for the S.E. of a S.D. is : 

<7 

The S.E. of a difference between two S.D.s is given by: 

V 2Ni ^ 2N, 


Ex. 33. — The S.D.s of the intelligence test scores of men and 
women students were 12-36 and 9-95 respectively, indicating that 
the dispersion of ability was greater among the men. The S.E. of the 
difference of 2-41 between these two figures is; 


/ 12-36== 9-95'‘ 

aJix 136 "^2 X 114 


0-998 


The difference is 2-42 times its S.E., and this shows a fair degree of 
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statistical significance. The odds against such a difference occurring 
through chance errors are 49*23 to 0-77, or 64 to 1. The result 
therefore allows a reasonably strong presumption that men students 
of the type tested are in general more spread out in their ability at 
an intelligence test than are women students. 


S.E, of a Percentage and of a Difference between Percentages , — 
Often it is not possible to measure the amounts of some quality in a 
group of persons, but only to state what percentage of the persons 
possess or do not possess it. Suppose that in several schools 60% of 
the boys and 65 % of the girls regularly take milk, does this indicate a 
reliable sex difference ? Again the trustworthiness of the result depends 
on the number of cases. Ifp is the pereentage, and q — 100 — p, 

then the S.E. of p is /^, and the S.E. of a difference between tv/o 
V N 

percentages is ; 

Ip^ , 

V Na 

Ex. 34. — From how many children would it be necessary to obtain 
milk-drinking records in order to be practically certain that the 
difference of 65% — 60% is reliable? 

This example is posed in the reverse form from Exs. 30-33, but the 
same method is employed. If a difference of 5% is reliable, it must 
be about times its S.E. That is, the S.E. must not be greater than 
2. Let the number of both boys and girls in the schools where records 
are obtained be N, Then : 


/60 X ^ ^ ^ X 35 

V N“ ■■ 


N - 1,169 


The answer is, then, that the sex difference of 5% would be signi- 
ficant if based on some 1,200-odd boys and an equal number of girls, 
but that it would not be so if the numbers were much smaller. 


One of the commonest faults in writings about educational topics 
is the quotation of percentage results without any indication of the 
reliability of these results, often without even any mention of the 
actual numbers of cases from which the P.E.s could be calculated. 

It should be noted that all the formulae quoted above for the S.E. 
of a difference (between means, between S.D.s, and between per- 
centages) are derived from one and the same formula : 

_ 2r(7icr2 

where and ^2 are the S.E.s of the means, S.D.s, and percentages 
respectively (not the S.D.s of the original scores). The third term. 
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applies whenever there is any correlation between the original 
measures from which the means, S.D.s, or percentages are derived. 
But when, as in most of our examples above, there is no such correla- 
tion, the term vanishes. 

Statistical Treatment of Small Samples . — It has been shown by 
Fisher (1930) and others that numerical results obtained from very 
small samples are even less reliable than the statistical treatment 
described above would lead us to expect. Under such circumstances 
it is better to calculate a S.E. by substituting iV — 1 for TV in the 
denominator. 

The resulting critical ratios are denoted by the symbol t, and 
special tables are available which show the probability of various 
values of t for different numbers of cases. ‘ For a difference between 
the means of two groups, the proper formula is: 


/ = 



Ml -Ms 

\/i , 

Ni + N,-2An;~^ 



and Zx 2 ^ are the sums of squared deviations from the respec- 
tive means. If calculated from raw scores, or from arbitrary means, 
the corrections described on pp. 46-7 must be inserted. 

For a difference between two sets of scores in the same group, it 
is simplest to tabulate the actual differences, between each person’s 
two scores (allowing for signs). Then : 


t = 


Ml -M2 

Zd^ — 

N 

N(N - 1) 


This automatically allows for the correlation between the two 
variables. 

Note that this modified treatment, though statistically preferable, 
makes very little difference until the TV’s fall below about 15. The 
enormous majority of educational experiments are carried out on 
much larger numbers. 

Analysis of Variance . — An extension of the /-technique enables us 
to explore the reliability of differences between several groups simul- 
taneously, instead of dealing with only two at a time. For example, 

^Cf. Fisher and Yates (1938). Instead of the actual numbers of cases, the 
tables refer to what are called decrees of freedom, which are normally TV — 1. 
Thus for a difference between 13 boys and 10 girls, the degrees of freedom are 
12 + 9 = 21. 


M.A.— 7 
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we can test the significance of differences in means or percentages 
between sets of pupils taught by several alternative methods, or 
classes ih several schools, or children classified according to several 
grades of parental occupation. Again the groups may be classified 
in two or more directions simultaneously, say by teaching method, 
type of school, sex and intelligence level. The relative importance of 
these influences on average test scores, and their statistical signifi- 
cance, can thus be assessed. Note that this involves an important 
departure from the model of the classical scientific experiment, 
where the effects of altering one variable at a time were observed, all 
other variables being kept constant. Hence analysis of variance, 
which was first developed by Fisher for studying the effects of 
different fertilizers or other conditions in agriculture, has become 
one of the most powerful and most widely used tools of educational 
experimentation. It has also proved of particular value in analysing 
examination marks — how far these vary with the examiner, or with 
the question or topic set, or with the school from which the canui- 
dates come, and so on. 

The detailed techniques are somewhat elaborate for description 
here (cf. Lindquist, 1940), but the following simple example will 
give some indication of their nature. 

Ex. 35. — The writer’s men students were divided, according to 
their various courses of teacher training, into seven sections, each 
.numbering about 22 students. The average level of ability in different 
sections seemed to differ considerably, and their average percentage 
marks on examinations in psychology are listed in the second 
column of the table below. The third column shows, however, that 
the marks in any one section were very variable : 


Section 

Mean Psychology Mark 

S.D. of ^ 

A 

72-89 

8-25 

B 

68-69 

8-43 

C 

6S-59 

7-88 

D 

64-41 

7-05 

E 

63-95 

8-58 

F 

63-52 

10-71 

G 

63-48 

10-71 


usually ranging from 45% to 85%. Thus in order to answer the 
question whether there exists any relation between a student’s ability 
and the section to which he is assigned, we must determine whether 
these differences between the means of sections are significant or 
whether, in view of the variability of the marks within each section, 
such differences might arise through chance fluctuations of sampling. 
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“Variance” is simply another term tor spread, dispersion, or vari- 
ability (actually the variance of a distribution of measures is the 
square of its S.D.). Analysis of variance enables us to say^how far 
these differences between sections can be ascribed to the effect of 
individual differences within sections. The result is expressed in terms 
of the F-ratio, in much the same way as the significance of differences 
between two groups is expressed as t. (Indeed an F-ratio when only 
two groups are compared is identical with t^.) In this instance the 
F-ratio is 2-93, and appropriate tables tell us that the probability of 
such differences between seven means arising by chance is quite 
small, approximately 0*01. We may conclude that there is a definite 
tendency for better students to be put into some sections, poorer 
students into others. 

S,E. and P.E. of a Correlation Coefficient.^ — Tlic correlation be- 
tween two sets of measures is likely to be very unreliable if the num- 
ber of persons is small, as was shown in Ex. 28 (p. 76). The S.E. or 
P.E. of a correlation coefficient tells us how widely this coefficient 
may be expected to deviate from the true value, i.e. the value which 
would be obtained from an unlimited number of persons. The 
formulcC for these errors will be given in Chap. VI. 

S.E. and P.E. of an Individual Test Score or Examination Mark . — 
If some pupils or students take a test or examination two or more 
times, or answer two or more supposedly parallel tests in the same 
subject, they will not obtain precisely identical scores on each 
occasion. Hence any score or mark derived from only a single test is 
to some extent unreliable. It may deviate more or less from the score 
which would be obtained with far more thorough testing. In other 
words, the single test only gives us a limited sample of the pupils’ or 
students’ capabilities. Here too, then, the conception of the S.E. has 
valuable applications. But the reliability of a score depends, not on 
the number of persons who take the test, nor on the wideness of 
dispersion of scores, but on the thoroughness of the test itself, or the 
number of items of which it is composed, and on the amount of 
variability in pupils’ answers to these items. Elementary methods of 
calculation are given in Chap. VII. A full treatment of the underlying 
theory may be found in Gulliksen’s (1950) book. 

Finally, when a test or examination is employed for predicting a 

^ It seems to be the convention in psychological literature to employ the S.E. 
•in the statistical treatment of averages and the like, and the P.E. in the treatment 
of correlations and individual scores. There is no logical reason for the conven- 
tion, but we shall usually adhere to it in this book. 
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pupil’s success or lack of success during his subsequent educational 
or vocational career, it is obvious that the estimate obtained from 
the test score will deviate more or less from the truth. The^liability 
of such predictions naturally depends mainly upon the ade^acy or 
validity of the test. It, too, can be expressed in terms of a S.E. or P.E., 
and the appropriate methods will be given below. 


TESTING DIFFERENCES BETWEEN CATEGORIC 
VARIABLES BY CHI-SQUARED 

In educational and psychological research, there are many types 
of data which we may wish to analyse or compare, but which are not 
amenable to averaging or to the calculation of Standard Errors. For 
example, the colour of people’s eyes cannot readily be measured as 
a continuous variable, yet we can classify individuals into a number 
of more or less distinct groups or categories according to their eye- 
colour. Similarly, we may wish to study the numbers of pupils at 
different ages who go to the cinema 0, 1, 2 or more times per week, 
or the numbers who answer Yes, Doubtful or No to an item in a 
questionnaire. Or we may wish to compare how many boys and 
girls are given gradings of VG, G, Fair and Poor in an examination. 


Ex. 36. — A group of 80 male students stated anonymously that 
their voting in a General Election was : 23 Conservative, 1 3 Liberal, 
and 44 Labour. We shall take it that, in the country as a whole, the 
proportions had been 457^, 10% and 45%. Are we justified in con- 
cluding that the students were more left-wing than the average, or 
might this difference arise* just through chance errors of sampling? 
Our approach is to assume that no real differences exist between our 
group and a group of 80 typical voters — this is termed the ‘null 
hypothesis’, and then to analyse the degree of unlikelihood of the 
difference occurring by chance. 

In the following table, F gives the obtained frequencies of students 



F 

Fe 

F - Fe 

(F - Feli* 

(F - tcV 
Fe 

Conservative . 

23 

36 

- 13 

169 

4-694 

Liberal . 

13 

8 

+ 5 

25 

3-125 

Labour . 

44 

36 

+ 8 

64 

1-778 


80 

80 


X* 

= 9-597 


in each group or ‘cell’. represents the expected frequencies in an 
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equal number of typical voters (45% of 80 = 36, and so on). We 

I 2J(F — F 

then^c^ulate Chi-Squared (x*) = — p — - . Note ‘ihat the 

F — Fe column must add up to zero. The Chi-Squared Tables,^ 
given in most statistical textbooks, tell us that this value, 9*597, has 
a P or probability of close to 0 01, i.e. that it might arise by chance 
only about once in a hundred similar samples. Hence there is a 
fairly strong presumption that students of the kind we have studied 
are less conservative than the general body of voters. Had our sample 
been double the size, chi-squared would likewise be doubled, and 
the probability would have dropped to less than 1 in 1,000. 


The same procedure is used for comparing two or more groups, 
except that we now ask whether our samples show the same distri- 
bution, not whether they conform to some predetermined .typical 
distribution. 


Ex. 37. — Groups of 75 boys and 65 girls gave the following as 
their numbers of visits to the cinema per week. Is there a statistically 
reliable sex difference? 




F 



Fe 

(L 

- Fe)» 

No. of 







Fe 

Visits 

Boys 

Girls 

Total 

Boys 

Girls 


0 

5 

10 

15 

803 

6-97 

1-14 

1-32 

1 

33 

32 

65 

34-82 

30-18 

0-10 

0-11 

2 

26 

18 

44 

23-57 

20-43 

0-25 

0-29 

3 

7 

4 






4 

3 

1 

4 [ 

8-57 

7-43 

0-69 

0-74 

5 

1 

— 

ij 






— 

— 




V2 = 

= 4-64 


75 

65 

140 

74-99 

65 01 

■P = 

= -20 


If the null hypothesis were true, we would expect the same pro- 
portion of boys as of girls in each line of the table; i.e. of the 15 

75 65 

children with 0 visits, x 15 — 8*03 should be boys, and x 

15 6-97 should be girls. Each entry in the columns is similarly 

calculated. Note that these P^’s must add up to the same totals as 

^ Such tables show the probability of chi-squared for different ‘degrees of 
freedom’. Degrees of freedom here means the number of cells whose F’s must 
be known in order to complete the distribution, in this case 2. For the total 
number of students is fixed, and we need only to know the numbers voting for 
two of the parties in order to determine the F in the third cell. 



90 


THE MEASUREMENT OF ABILITIES 


the F’s both across and down the way. The last three rows are best 
combined together, since there is a general rule that no value of Fe 
in a chf-squared analysis should fall below 5, or preferably 10. 

^ ® is calculated as before, but this time for both columns. 

The obtained value of chi-squared is not significant statistically^; 
our differences between the sexes could occur by chance 1 in 5 times 
in other similar samples. 


This particular problem might also have been studied by compar- 
ing percentages, using the formula on p. 84. 49*3% of the boys and 
35*4% of the girls go twice or more a week. The C.R. works out at 
1-68, which has a P value of 010. Though this gives a slightly more 
favourable result, it too provides no acceptable evidence of a reliable 
difference ; and the more detailed chi-squared analysis is to be pre- 
ferred. Another point brought out by this example is that the F’s in 
each cell should invariably refer to numbers of separate and indepen- 
dent individuals or items. A very common error among students is 
to say that 26 boys have 52 visits between them, and 18 girls 36 
visits, and so on, and then to apply chi-squared to the table of visits. 

Another useful application of chi-squared is in testing whether an 
obtained distribution conforms to the normal (or any other theo- 
retical) shape. We have seen that distributions of small numbers of 
measures may depart considerably from normality, and that by in- 
creasing N, the irregularities tend to become ironed out. Here also, 
therefore, the result obtained from a limited sample (namely an 
irregular distribution) can deviate considerably from the result which 
a much larger sample would yield (a regular distribution) purely 
through errors of sampling: and the unlikelihood of this deviation 
can be calculated by chi-squared. 

Ex. 38. — The following figures are extracted from Ex. 20 (p. 50). 
They show in the F column the actual frequencies of scores obtained 
by students on the Cattell test, and in the F^ column the frequencies 
which would be expected if the distribution of scores was a truly 
normal one. The procedure is as before, the smallest frequencies, at 

^ Here the degrees of freedom are 3. The totals across and down the way are 
fixed, thus we need to know only 3 cell entries to complete the table. In the 
following example, there are 8 degrees. For though 1 1 pairs of categories are 
compared, 3 quantities are fixed — the number of cases, the mean score, and the 
standard deviation. For further explanation of this rather tricky concept of 
degrees of freedom, see Fisher (1930). 
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F 

Fe 

F-Fe 

V* * 

Fe 

1 

n 


1 

4 

+ 2 

0-6667 

6 

4j 



11 

9 

+ 2 

0-4444 

22 

19 

+ 3 

OAW 

26 

30 

-4 

0-5333 

35 

41 

-6 

0-8780 

47 

44 

+ 3 

0-2045 

43 

40 

+ 3 

0-2250 

27 

28 

- 1 

0-0356 

18 

18 

0 

0-0000 

10 

9 

+ 1 

0-1111 

3 

0 

J} 

- 3 

1-5000 

250 

250 


5-0723 


the head and foot of the columns, being grouped together. Chi- 
squared has a probability of 0-75. We might expect as great a 
deviation, or greater, in three instances out of four by pure chance. 
Thus the approximation to normality is quite satisfactory. 
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MEASUREMENT OF CORRELATION 

'A VERY wide range of problems in educational and other branches 
of psychology depends on the study of relationships between 
variables. For example, examinations are set at the end of the pri- 
mary school career, partly in order to test the work that the children 
have done, but also partly because achievement in English and arith- 
metic is supposed to be related to, or predictive of, future achieve- 
ment in more advanced work. The main object of applying intelli- 
gence, or other aptitude, tests is to show probable ability in various 
fields which are related to the test performances. In many secondary 
schools the pupils are allotted to different classes for general work, 
for mathematics, and for foreign languages, since it is believed (with 
much justification) that abilities in these three types of study are not 
very closely related, that some may be good in general work, poor at 
mathematics, medium at languages, and so on. 

Several different methods, applicable to different kinds of data, 
are available for determining the extent of relationship between two 
variables. Almost all of them yield indices ranging from 0 (no 
correlation) to + 10 (perfect correlation). By far the most frequently 
used are the product-moment correlation coefficient and the rank- 
order correlation coefficient. We will point out later the limitations 
of these two methods, and mention briefly some of the alternatives. 

Ex. 39. — ^The following marks were obtained by a class of 36 
children on examinations in history and geography. A glance will 
show that there is some correspondence; that those who obtain 
high marks in one tend to get high marks in the other, and that low 
marks in one tend to go with low marks in the other ; but that there 
are many exceptions to this trend. The state of affairs may be seen 
more clearly if the two sets of measures are plotted against one 
another. Fig. 21 is called a scatter-diagram or scattergram. The 
marks have both been grouped into classes of 2. History marks (JiTi) 
are shown on the vertical axis, geography marks (A^g) on the hori- 
zontal axis. Each pupil’s position in the two exams is shown by a 
tick. For example, Alan’s 16 and 12 are represented by a tick 
opposite the 15-16 mark in history, and below the 11-12 mark in 
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Name 

X, 

History 

X. 

Geography 

1 Xj 

X, 

X, 

X, 

Alan 

.... 16 

12 

' 10 

5 

s 

6 

Helen 

.... 19 

15 

14 

18 

20 

12 

Job;? 

.... 17 

11 

8 

6 

17 

11 



.... 10 

17 

6 

4 

12 

8 


.. .. 18 

14 

18 

20 

7 

2 


. .. 18 

16 

14 

8 

17 

16 


.... 15 

4 

12 

4 

14 

11 


.... 17 

13 

18 

14 

14 

9 


.... 12 

12 

16 

10 

12 

3 


. .. 16 

8 

6 

4 

16 

1 


... 11 

10 

18 

14 

20 

15 


.. . 14 

9 

12 

11 

1 17 

6 

geography. Four pupils obtained 

either 17 

or 18 in history and 


either 13 or 14 in geography, hence there are four ticks in this box 
or cell. 


GEOGRAPHY (X2) 



1 2 

3 4 

5 6 

7 8 

9 lO 

II 12 

15 14 

15 16 

\J. 18 

19 20 

19 20 






1 





, 

17-18 



1 



II 

Uh" 

II 


1 

15- 16 

1 

1 


1 

1 

y' 

y 





13 14 




11 


1 



1 


II 12 


11 



1 

II 





9- 10 









1 


7- 8 

1 . 
y 

y 

II 








1 

m 


II 










Fig. 21. — Scattergram, comparing two sets of school marks. 


It will be seen that the majority of ticks tend to be grouped along 
the dotted diagonal line, although there are some rather extreme 
exceptions; for instance, the pupil with marks of 16 and 1, and the 
pupil with marks of 10 and 17. Obviously the closer the correspon- 
dence between the two sets, the more closely would the ticks fall along 
the diagonal, and the fewer would be the exceptions whose ticks are 
far removed from it. Fig. 26 (p. 133) and Ex. 49 (p. 113) show the 
scattergrams for a low correlation of + 0*146 and a high correlation 
of + 0*799 (the numbers in each cell represent the numbers of ticks). 
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Only if each pupil got marks which exactly corresponded would the 
correlation be perfect and the coefficient be + 1 *0. Note, however, that 
it is not necessary for the marks on the two variables to be identical 
for a perfect correlation to be obtained, since these marks may be 
arranged on different scales. In Fig. 21 the geography marks are on 
a different scale from the history marks ; 10 on the one corresponds, 
or is relatively equivalent, to 14 on the other. Consider another in- 
stance : the weights and heights of a class of children would certainly 
yield a very high correlation, although they would be measured on 
entirely different scales — pounds and inches. It is the closeness of 
agreement of relative scores or measures which governs the size of 
the correlation. 

Very few correlations in educational or psychological investiga- 
tions approximate to + 10. Usually there is only a general trend for 
high scores on one variable to go with high scores on the other, a 
trend which admits of many exceptions. Two tests of the same ability, 
e.g. two similar arithmetic papers, may perhaps give a coefficient of 
+ 0-90 or over. Tests or exams in different subjects, c.g. arithmetic 
and English, or history and geography, are more likely to give 
moderate correlation coefficients of + 0*40 to + 0-70. 

Occasionally one may find negative or inverse correspondence, 
when high scores on one variable go with low scores on another 
variable. For instance, in an ordinary school class intelligence or 
scholastic achievement and age arc often negatively related, since 
the oldest pupils are rather dull, the youngest pupils very bright. 
Negative correlations are expressed, just as positive ones, by indices 
ranging from 0 to — 1 0. • 

Calculation of Correlations by the Product-moment Method (some^ 
times called the Pearson-Bravais Method). — The fundamental 
formula for the correlation coefficient, r, is : 

ZjCiXo 

Here o-^ and (Tg are the S.D.s of the two distributions, represents 
a score or measure expressed as a deviation from the mean of all the 

measures, is a measure expressed as a deviation from the mean 
of all the measures, x^x.^ is the product of a person’s two measures, 
and Xx^x^ is the sum of all these products. 

Since the means are seldom likely to be whole numbers, the com- 
putation of Ex^x^ would involve very troublesome arithmetic. 
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Hence, just as in calculating S.D.s, it is usual to resort to the device 
of the arbitrary mean. Each of the quantities in the above formula 
then requires a correction. If is a deviation from an arbitrary 
mean, is no longer 


^JLbut 
N V N 



X 

(cf. p. 47). A corresponding alteration is made in org. — - becomes 

X Hence the full formula is : 

N N N 


Ux^x^ _ ^Xi Ex^ 

N “ N N 

VN" \N/V"N \N/ 


This same formula applies if the original scores are used, i.e. if Xi 
and X 2 are substituted for x^ and x^. The only difference is that the 
arbitrary mean is now zero, instead of being some whole number 
chosen close to the true mean. 

Statistical textbooks often provide modifications of this formula, 
or ‘patent’ correlation techniques which are claimed to be simpler. 
The present writer prefers the direct method given here, beth because 
it makes use of concepts such as the true and arbitrary mean, o-, and 
the correction required when a is calculated from an arbitrary mean 
— concepts with which the student should already be familiar — and 
because errors in computation are fairly readily detected. Moreover, 
a coefficient accurate to three places of decimals can be obtained by 
this method if the multiplication and division, and the finding of 
squares and square roots, are done with a 10-inch slide-rule and/or 
four-figure logarithm tables. 


Ex. 40. — The formula will be applied first to the original table of 
marks, given in Ex. 39, and later to the same marks grouped into 
classes (as in Fig. 21). This first method should generally be employed 
only when N is small, say less than 50, and when a high degree of 
accuracy is needed.^ Tn the following table, columns X^ and X 2 list 
the original marks. As arbitrary means, 14 and 10 are selected. 

‘ When an electric calculating machine is available, this method is generally 
used with original scores (not an arbitrary mean), even if N is very large. 
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Xi 

X, 

Xi 



X,* 

XiX, 

16 

12 

+ 2 

+ 2 

4 

4 

+ 4 

19 

15 

+ 5 

+ 5 

25 

25 

+ 25 

17 

11 

+ 3 

+ 1 

9 

1 

+ 3 

10 

17 

-4 

+ 7 

16 

49 

- 2 

18 

14 

+ 4 

+ 4 

16 

16 

+ 16 

18 

16 

+ 4 

+ 6 

16 

36 

+ 24 

15 

4 

+ 1 

- 6 

1 

36 


17 

13 

+ 3 

+ 3 

9 

9 

+ 9 

12 

12 

- 2 

+ 2 

4 

4 


16 

8 

+ 2 

- 2 

4 

4 

— 

11 

10 

- 3 

0 

9 

0 

0 

14 

9 

0 

- 1 

0 

1 

0 

10 

5 

- 4 

- 5 

16 

25 

+ 20 

14 

• 18 

0 

+ 8 

0 

64 

0 

8 

6 

- 6 

-4 

36 

16 

+ 24 

6 

4 

- 8 

- 6 

64 

36 

+ 48 . 

18 

20 

+ 4 

+ 10 

16 

100 

+ 40 

14 

8 

0 

- 2 

0 

4 

0 

12 

4 

- 2 

- 6 

4 

36 

H- 12 

18 

14 

+ 4 

+ 4 

16 

16 

+ 16 

16 

10 

+ 2 

0 

4 

0 

0 

6 

4 

- 8 

- 6 

64 

36 

+ 48 

18 

14 

+ 4 

+ 4 

16 

16 

+ 16 

12 

11 

- 2 

+ 1 

4 

1 


8 

6 

- 6 

- 4 

36 

16 

24 

20 

12 

+ 6 

+ 2 

36 

4 

+ 12 

17 

11 

+ 3 

+ 1 

9 

1 

+ 3 

12 

8 

-2 

- 2 

4 

4 

+ 4 

7 

2 

- 7 

' -8 

49 

64 

+ 56 

17 

16 

+ 3 

+ 6 

9 

36 

+ 18 

14 

11 

0 

+ 1 

0 

1 

0 

14 

8 

0 

- 2 

0 

4 

0 

12 

3 

-2 

r- 7 

4 

49 

+ 14 

16 

1 

+ 2 

- 9 

4 

81 

- 11 

20 

15 

+ 6 

+ 5 

36 

25 

+ 30 

17 

6 

+ 3 

- 4 

9 

16 

- 1: 

N = 

36 

+ 61 
- 56 
= 4- S 

+ 72 

- 74 

— 7 

549 

836 

+ 466 

- lA 

— _1_ lOT 
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Columns and Xg list the deviations of Xi and from these /I's. 
They are summed to give Exi and Sx^. 


Sx, 

N 


+ 5 
36 


+ 0 1389 


Xx^ __ — 2 
36" 


t 

-0 0555 


Therefore the true means are 14-139 and 9 945. Later we shall need 



and^x— » 
“ N ^ N 


These come to + 0 019, f- 0*003, and — 0 008 respectively. Note 
that three places of decimals are sufficient for these corrections. 

2Jx Sx 

Special attention should be paid to the sign (+ or — ) of x 


The two columns on p. 96, and XgS give the deviations squared. 
These are summed and averaged. 


N 


549 

= y= 15 250 


Ex9^ 836 


N 


= ^ = 23 222 
Jo 


* Hence a, - - a/15-250 - 0-019 = 3 903 

(7j = V23 222 - 0 003 = 4-819 


The last pair of columns gives the products of the deviations; the 
+ and — products are separated merely for convenience in summing. 
Special attention to the signs is needed When Xi and Xg are both + 
or both — the product is plus, when one is + the other — the pro- 
duct is minus. Notice how the exceptional cases, such as the pupil 
with 10 and 17, make large negative contributions to Z’xiXg, and so 
ultimately reduce the size of r. The pupils who get very high marks 
on both variables, or very low marks on both, such as those with 6 
and 4 or with 18 and 20, make large positive contributions to -TxiXg 
and help to raise r. 


Z'xiXa _ + 392 
~N 


10 889. 


Ux Ex 

The correction, X is subtracted from this, but as its sign 

was — , it is in this case added. 

X 10 889 -(-0008)= 10-897 

All the corrections indicated in the second formula have been 
made, hence we can finally apply the first, or fundamental, correla- 
tion formula . 

Ex^xJ 1 \ 10*897 

N \<7i(Ta/ 4*819 X 3*903 


r 


+ 0*579 
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Thus the coefficient shows moderate correspondence between the 
two sets of marks, as we guessed from looking at the scattergram. 

Product-moment Correlation between Tabulated Measures , — When 
the measures are grouped into classes, the procedure is essentially 
identical, except that the computations arc, of course, performed for 
all the measures in any one class at one time. We are no longer con- 
cerned with the deviations of single measures from their mean, and 
their products, but with deviations of classes from the mean class, 
and their products. At no time is it necessary to translate the results 
back into terms of the original measures, by multiplying by c (the 
size of the class). 

Ex. 41. — The two sets of measures are first plotted in a scatter- 
gram, which should contain, if possible, between 1 1 and 19 rows and 
a similar number of columns. There is no need to have identical 
numbeis of rows and columns. In Fig. 21 (p. 93) the data are, as a 
matter of fact, rather too coarsely grouped, since there are only 8 
rows and 10 columns. Over-coarse grouping has the effect of reduciRg 
slightly the size of r (a formula for correcting this may be found in 
advanced statistical textbooks). By adding up the frequencies in each 
row, and in each column, we get the two frequency distributions oi 
grouped measures. Fig. 21 is reproduced below with the frequency 
distribution of measures (/\) written at the right-hand side of the 
rows, and the frequency distribution of Xg measures written at 
the foot of the columns. The frequencies in both these distributions 
should be checked by making sure that they add up to N, 


= - 

-4 - 

-3 - 

-2 - 

-1 

0 +1+2+3 

f 4 + 5 

Xi= +3 




. 


1 


2 



+ 2 



1 



2 

4 

2 


1 

+ 1 

1 

1 


1 

1 

1 





0 




2 

1 

1 



1 


— 1 


2 


1 

1 

2 





-2 



1 






1 


-8 

1 


2 








— 4 


2 









F* 

2 

5 

4 

4 

3 

7 

4 

4 

2 

1 


Fx 

3 

10 

5 

5 

6 
2 
8 
2 


36 = N 
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An appropriate arbitrary mean is now chosen for the Xj classes 
and for the Xg classes, and the classes above and below these means 
are numbered + 1, + 2, . . . — 1, — 2, . . . etc. These nuftiberings 
are written at the left-hand side of the rows, and above the tops of 
the columns. Standard deviations are now calculated in the usual 
way. 


Xi and Xa 

~\~ 5 

+ 4 


F. ! 
1 ! 
2 i 

F,jr, 

Fj-Tj 

+- 5 

8 


FjJCa' 

25 

32 

+ 3 

3 

4 ! 

H- 9 

12 

27 

36 

+ 2 

10 

4 1 

20 

8 

40 

16 

+ 1 

5 

7 

1 

5 

7 

5 

7 

0 

5 

3 

i 

-1- 34 

+ 40 

0 

0 

- 1 

6 

4 1 

- 6 

-- 4 

6 

4 

- 2 

2 

4 : 

4 

8 

8 

. 16 

3 

3 

5 ; 

9 

15 

27 

45 

-4 

2 

2 

8 

8 

32 

32 


36 

36 

- 27 

- 35 

145 

213 


Z'FiXj -TFaXo 

- -I- 7 - 4- 5 


2TiXi 

N 

N 

The correction 


+ 0-1944 = 0-0378 

- ^ - + 0-1389 ^ "--0-0193 

X which will be needed later in the 

N N 


numerator of the correlation formula, is: 


+ 0 1944 X + 0-1389 - -| 0 027 
a =4 028 


^ V4 028 - 0-038 = 1-997 

LF 2X2^ ^ 213 „ c Q 17 
“N 36 

tTa = \/5^9T7 ^-^9 = 2-429 

Note that these S.D.s are approximately one-half those obtained 
in the previous computation of r (namely 3-903 and 4-819), the 
reason being that we have grouped the original measures into classes 
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of 2. Tliey are not exactly half because the grouping has produced 
some slight distortion of the distribution. 

Now to obtain the products, the frequency in each cell must be 
multiplied by its Xi value and by its value. For instance, the entry 
2 in cell (x^ = + 3) (xg = + 3) contributes 2x3x3— + 18 to 
I^FXiXg. The entry 1 in cell (xi = — 2) (Xj = + 4) contributes 


1 X 4 X 

- 2 = - 8. 






Cells corresponding to 

Frequencies 

Total 

Fxix, 

*,x, 

this value of JCiXg 

in these cells 

F’s 


0 

=- + 1 

Xj = 0 

1 1 




0 

0 

1 




- 1 

0 

0 

- 1 

1 

2 1 

. 7 

0 


0 

+ 1 

1 1 




0 

+ 4 

1 J 



+ 1 

Xi = + 1 
- 1 

x^ = + 1 

- 1 

1 1 

1 . 


+ 2 

+ 2* 

Xi = + 2 

X2 = + 1 

2 

2 

+ 4 

+ 3 

Xi -^ + 3 
- 1 

Xj = + 1 

- 3 

1 " 

2 , 

} ^ 

+ 9 

t 

+ 4 

Xi = + 2 
- 2 

X2 = +2 
- 2 

4 ^ 
1 

\ ^ 

+ 20 

+ 6 

Xi ^ + 2 

- 3 

Xi 1-3 

- 2 

2 " 
2 

1- “ 

+ 24 

+ 9 

Xi = + 3 

X2 -- + 3 

2 ' 

2 

+ 18 

+ 10 

Xi = + 2 

X2 = 1 5 

1 

1 

+ 10 

+ 12 

Xi - 3 
- 4 

Xj — 4 
- 3 

1 ' 
2 

} ^ 

-h 36 






-h 123 

- 1 

Xi = + 1 

X2 = — 1 

1 ■ 

V 3 

— 3 


- 1 

+ 1 

2 

/ 


- 3 

Xj = + 1 

Xa ^ — 3 

1 ■ 

1 

- 3 

- 4 

Xj = + 2 

X2 — — 2 

1 ■ 

V 2 

- 8 


1 1 

- 4 

1 

; 

- 8 

Xi = — 2 

^# = + 4 

1 

1 

- 8 






- 22 





XjXj = 

+ 101 


It is unnecessary, however, to consider each cell in turn. Instead 
we take each possible value of X 1 X 2 in turn, as shown in the accom- 
panying table. All the cells for which either = 0 or Xj = 0 contri- 
bute zero to i^FxiXg. It will be seen that the total frequencies in such 
cells is 7. The total frequency in the cells where XjXj = + 1 is 2, 
hence + 2 is entered in the last column. 
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As soon as the student becomes familiar with the method he will 
find it unnecessary to write out the second, third, and fourth 
columns of such a table. Only the first, and the last two are 'needed. 
The simplest plan is to draw a cross through each of the cells for 
which X 1 X 2 ^ 0, and to count up the frequencies in these cells while 
doing so, and then enter the total in a table thus: 

^lYa F FxiXi 

0 7 0 

+ 1 2 +2 


Next cross out the cells where XiX^ — 1 1, and enter in the table, 
and so on. At the end, sum the F column, in order to make sure that 
none of the entries in the cells have been omitted. The last column 
is summed to give 2 JFxiX 2 , and the rest of the working is as before. 

^ — 2*806. The correction, to be subtracted, is 

- N 36 

+ 0-027. 

2-806 - 0-027 - 2-779. r ^ 2-429 -997 ^ 

It will be seen that in spite of the distortion due to grouping the 
scores into a rather coarse scattergram, the result is nearly identical 
with that obtained from the ungrouped measures, namely + 0-579. 


There are, of course, numerous possibilities of error in working out 
product-moment correlations. All additions should be checked up 
and down ; multiplying, dividing, squaring, etc., may be done with 
logarithm tables and confirmed with a slide-rule. But a little practice 
will enable the student to predict r fairly closely from consideration 
of the scattergram, and so to judge whether or not the value he 
reac hes is likely to be correct. 

An Alternative Method of Calculating Correlation from the Scatter- 
gram . — We will describe one of the various alternative methods, 
which may be found simpler to use than the above, direct, method. 
It eliminates the collection of the contributions to HFX 1 X 2 from each 
cell, and substitutes the calculation of two additional S.D.s, which 
we shall call 0-3 and cr^. Actually, one of these is sufficient, but the 
other acts as a valuable check. The formula for r is : 


r — 


1^ ^ ^2^ - ^4^ 


2(7jtJ2 2(7 0*2 

Here ar^ and o-g are the S.D.s of the Fi and F 2 distributions as before. 

M.A.— 8 
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The other two distributions are obtained by counting the frequencies 
along the diagonals. 

Ex. 42. — Applying this to the scattergram in Ex. 41 : for 0-3 we 
start from the top right-hand corner and proceed to the bottom 
left-hand, counting all the entries along the diagonal at right angles 
to this line as we go. On the diagonal running from (x^ - 1 - 4) 

(■x:2 = + 3) to (xj ^ + 2) (x 2 — -|- 5) there is one entry. The next 
diagonal runs from (xi ^ + 3) (x^ — I 3) to (xi — - + 1) 
(•^2 — + 5) ; this takes in 2 entries. The next also takes in 2, the 
next 1 + 4 + 1 ^ 6, and so on. The complete distribution, F3, is 
given in the following table. The corresponding distribution 
along the diagonal from top left to bottom right, is also listed. 
Arbitrary means are chosen and the S.D.s worked out in the usual 
way. 


Fo 

F4 

d 

FW 

F^d 

Fad‘ 

F4</2 

. 1 


+ 7 

7 


49 


2 


+ 6 

12 


72 


2 

1 

+ 5 

10 

5 

50 

25 

6 

2 

+ 4 

24 

8 

96 

32 

2 

0 

+ 3 

6 

0 

18 

0 

2 

4 

+ 2 

4 

8 

8 

16 

2 

6 

-1 1 

2 

6 

2 

6 

5 

10 


+ 65 

1 27 

0 

0 

3 

8 

- 1 

3 

8 

3 

8 

2 

2 

- 2 

4 

4 

8 

8 

1 

1 

- 3 

3 

3 

9 

9 

3 

1 

- 4 

12 

4 

48 

16 

2 

0 

- 5 

10 

0 

50 

0 

0 

1 

-- 6 

0 

6 

0 

36 

3 


- 7 

21 


147 





- 53 

- 25 

560 

156 



65-53 


2^3^/- 

- 560 



_ 


- 0-333 



— 15*556 

N 


36 


N “ 

36 




^3“ - 15-556 - (0-333)^ - 

15-445 


Similarly a.‘ 


4-333 - 

(0-056)2 

4-330. 



We already know and to be 3-990 and 5-898. Hence: 



_ 15-445 

- 3-990 

- 5-898 




r 

~ 2 X 

1-997 X 

2- 429 

- + *573 

The version using a. instead of a 

gives th{ 

^ same 

result, which is 

also identical with that obtained on p. 99. 
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Rank‘Order Correlation . — A different method for determining 
correlations is available for data arranged in order, for example the 
orders of merit of a school class on two examinations. When the 
number of cases is greater than about 30, the figures involved become 
troublesomely big, but with a small number the method is much 
easier to use than the product-moment method. Often, therefore, sets 
of scores are turned into rank orders and the correlation between 
these orders is computed. The formula is as follows: 

^ N(N» - 1) 

where d is the difference between each person’s two ranks. Such a 
correlation is often indicated by the Greek letter f 2 (rho), instead of 
by r. 


Ex. 43. — The second and third columns of the followiiTg table 
give the marks of eleven pupils on tests and Jfa- The next two 


P’upil 

X. 

Xj 

Ranks 


d 


A . 

. 65 

75 

6 

4 

2 

4 

B . 

. 85 

66 

1 

7 

6 

36 

c . 

. 64 

58 

7 

9 

2 

4 

D . 

. 78 

82 

3 

1 

2 

4 

E . 

. 60 

58 

8 

9 

1 

1 

F . 

. 70 

80 

5 

2i 

24 

6-25 

G . 

. 83 

80 

2 

24 

4 

0-25 

H . 

. 55 

71 

10 

5 

5 

25 

1 

. 50 

58 

11 

9 

2 

4 

J 

. 72 

55 

4 

11 

7 

49 

K . 

. 59 

68 

9 

6 

3 

9 


142-5 


columns show the ranks, pupil A being 6th on 4lh on X^, and 
so on. The column gives the dilferences in rank, regardless of sign. 
These are squared and summed. Substituting in the formula: 


6 X 142-5 
11(121 - 1 ) 


1 - 0-648 


-I- 0-352 


Clearly the bigger the differences in rank positions, the bigger will 
Xd^ be, and the lower the value of r. Ii61ld^ is greater than NCN^— 1) 
the correlation will be negative. 

Now there is no difficulty in turning scores into ranks when, as in 
Xi in the above example, every person gets a dilTcrent score. But 
when two (or more) persons tie with the same score, they then divide 
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the two (or more) ranks between them, as in Thus pupils F and 
G must sjiare 2nd and 3rd place, and are both called 2^, whilst the 
next highest pupil, A, is called 4. Again, C, E, and 1 share the 8th, 
9th, and 10th positions, and are all called 9, whilst J, the next highest, 
is 11. Such sharing is undesirable in that it distorts, and generally 
decreases, the size of the correlation. It follows, therefore, that the 
rank-order method should not be applied when there are a great 
many ties. 

OTHER METHODS OF MEASURING CORRELATION 

The nature and the main uses of other methods will be indicated 
briefly, but they will not be described in detail. Full accounts of them 
may be found in more advanced textbooks. 

Spearman^s Foot-rule Method, — This, like the rank method, is 
applicable to data arranged in rank order. It is perhaps even simpler 
than the ordinary rank method, but the coefficients which it yields 
range only from + 10 to — 0*5, not from h 10 to — 10, and they 
often diverge rather widely from those obtained by the first two 
methods. A new and easy method, which possesses certain important 
statistical advantages over the rank method, has been described by 
Kendall (1948). 

Correlation Ratio , — This is denoted by the symbol n (eta), instead 
of by r. We saw in Fig. 21 (p. 93), that the ticks in a scattergram tend 
to be grouped along a straight diagonal. This is referred to as ‘linear 
regression.’ When the regression is non-linear, that is, when the ticks 
tend to be grouped round a curve, a product-moment correlation 
fails to show the closeness of relationship, and the correlation ratio 
is used instead. For example, when tests of motor perseveration are 
compared with assessments of desirable character traits, it is some- 
times found that both the persons with high perseveration scores 
and those with low scores are weak in character, whereas those with 
medium scores get the best character assessments. This type of 
relationship is illustrated in the accompanying scattergram, Fig. 22. 

Coefficient of Contingency , — Ordinary correlation methods cannot 
be applied to categoric variables such as those described on p. 88, 
nor to grossly non-normal distributions. We can nevertheless analyse 
the association between, say, hair and eye colour, or between religious 
affiliation and socio-economic grade, etc., by an extension of the chi- 
squared technique. The individuals are first classified into convenient 
numbers of categories for each variable, and a cross-classification 
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MOTOR PERSEVERATION SCORES ► 

Fig. 22. — Scattcrgiam of non-lmcar correlation. 

tabic — similar to a scattergram — is drawn up. We then assume that 
there is no relationship between the variables and compute, on the 
basis of this hypothesis, how many individuals would be expected to 
fall within each of the cells. If the actual frequencies diverge from 
these expected ones to an extent which cannot possibly be ascribed 
to chance errors of sampling, this provides evidence of a relationship 
between the variables. 

Ex. 44. — Records of the times spent on television viewing over 
one week were kept by 293 grammar school pupils, who were also 
classified according to the number of years their families had had 
television sets. Docs the following table indicate that pupils from 
families which have owned sets for a long time view less than 
relatively recent owners? 

Length of Ownership 




Less than 

1 to 2 

3 years 




1 yeai 

years 

and over 


Hours of 

8 + 

14 

31 

41 

86 

viewing 

4-7 + 

14 

49 

66 1 

129 

in 1 week 

0 - 3 + 

5 

21 

52 

78 



33 

101 

159 1 

293 
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Note that there is no need to have the same number of categories 
for both variables. Quite broad ones are generally adopted, so that 
no value of falls much below 10. 

Now if there were no relation between the two variables, we 
might expect the 86 pupils with 8 + hours viewing to be distributed, 
as regards their ownership periods, in the proportions 33 : 101 : 159. 

33 

That is, the F/s for the first line of the table should be; 86 x 

86 X and 86 x ^ — 9*7, 29*6, and 46-7. The other entries in 

the Fe table are similarly calculated, and the F — F^ table does 
suggest that a certain excess of short-owners view for more hours 
than long-owners. 


9-7 

29-6 

46-7 

86 




14-5 

44-5 

70-0 

129 



— 

8-8' 

26-9 

42-3 

1 78 

+ 4-3 

+ 1-4 

- 5-7 






- 0-5 

1 4-5 

4-0 

33 

101 

159 

293 

- 3-8 

- 5-9 

+ 9-7 


Table or F^-’s. Table of (F — F^). 


1-91 

0-07 

0-70 

0-02 

0-46 

0-23 

1-64 

1-29 

2-32 

(F 

Table of — 

- 

r? 


Chi-squared - 8*55, and as there 
are 4 degrees of freedom (cf. footnote, 
p. 89), this has a probability of 0 07. 
Such odds are not high, but they are 
suggestive of a slight negative corre- 
lation between amount of viewing and 
length of ownership. 


This chi-squared can be converted into what is called the con- 
tingency coefficient, C. A 'further modification yields a coefficient 
which is more closely comparable to a correlation, r^. 




j 


N + 


= J_ / 

kM N + x' - d.f. 


Here, d.f. refers to degrees of freedom, and k provides a correction 
for small numbers of categories. 

No. of k 

Categories 

2 0-798 

3 0-872 

4 0-923 

5 0-949 

6 0-964 

8 0-979 

10 0-986 
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In our example, the pupils are grouped into 3 categories on each 
variable, and ki = k 2 = 0-872. Thus: 

C= / r - _ ^ 4 

V 293 4 8-55 ^ 0-8722 V 301-55 - 4 

- 0-172 - 0-163. 

Considerable caution is essential in interpreting contingency, 
since strictly it is a measure of divergence rather than of relationship. 
Suppose, for instance, that our original table of F’s had been : 


41 

14 

31 

86 


66 

14 

49 

129 


52 

5 

21 

78 


159 

33 

101 

293 

• 


C and would be the same as before, although the relationship is 
quite different; for now it is the middle-ownership group which 
views most, and the short-ownership group which views least. Thus 
contingency coeflicients never have a negative sign even though, as 
in our example, they happen to express a negative correlation. 

Two by Two Tables. — Ex. 45. — Taking the data from Ex. 39, 
suppose that we do not know the pupils’ actual marks but only the 
categories ‘top half’ and ‘bottom half’ on the two examinations.^ 
The following table may be drawn up: 

GEOGRAPHY 



Top Half 

1 1 or over 

Bottom Half 

10 or under 


^ Top H.ilf 

pH 15 or over 

o 

13 ^ a 

5 = 6 

18 

H 

^ BoMom Half 
W 10 or o\cr 

B1 

13 =- d 

18 


18 

18 

i 

36 


There are several ways of determining a correlation for such data. 
Two that have often been applied arc Yule’s coefficient oj association 
(Q), and his coefficient oj colligation (a>). 

^ There is no necessity for the proportions to be equal, as they are in the 
present instance. But coefficients are more reliable the nearer the split is to the 
median. 
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O = ~ be 

. ad -{-be 

where a, b, c, and d are the frequencies in the four cells. Substituting, 
we find Q= + 0*742. Neither of these coefficients is at all closely 
comparable to a product moment correlation, and contingency is 
therefore somewhat preferable. For the same table, ^ 7*111; 
hence C 0*406, and = 0*598. The latter figure compares quite 
well with the original product-moment r of + 0*579. 

A still more suitable technique is known as tetrachoric correlation 
(r,). This cannot easily be calculated, but it can be read off very 
quickly from specially prepared graphs (cf. Chesire et al, 1933). 
Note, however, that r^ can be used only when the variables to be 
compared (although both expressed in the form of two categories) 
can bq regarded as normally distributed. It therefore applies to our 
examination marks, but could not be used for sex, or eye-colour, 
and the like. In the present instance, r^ + 0 642. The discrepancy 
between this and our r of + 0*579 is probably due to the small 
number of cases and the irregularities of their distributions. 

Biserial r. — When one of the variables to be compared has been 
accurately measured and has yielded a normal distribution, but the 
other (though believed to be normally distributed) can only be 
expressed in the form of two categories, biscrial r may be used. For 
instance, if we wish to compare intclligcnce-test scores with success 
or failure in a scholarship examination, we may not know the 
examination marks, but we do know the intelligence scores of 
scholars and non-scholars. 


Ex. 46. — Assume that the pupils of Ex. 39 are divided into two 
groups according to their history marks— the top 17 and the bottom 
19. The mean geography marks of these groups are 12*23 and 7*89 
respectively. 


f'bts 


Ml - Ma M - Ma 

a z a z 


Ml and Mj are the means of the contrasted groups. In the alterna- 
tive formula, M is the mean of the total group, a is the S.D. of the 
total group, namely 4-819. p and q are the proportions falling into 


17 19 . 

the groups, ^ and z is the ordinate of the normal curve corre- 
sponding to p (obtainable from statistical tables) ; in this instance 
z = -3977. 
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rbis = 


12-23 - 7-89 
4-819 


0-2490 
^ 0-3977 


+ 0-563. 


Though this result agrees well with our product-moment r, 
tends to be somewhat unreliable and easily distorted by irregularities 
in the distribution. 


THE PROBABLE ERRORS OF CORRELATION COFTFICIENTS 


A correlation coefficient is, like any other numerical psychological 
result, liable to vary on account of sampling errors, especially when 
the sample of persons, whose scores are inter-correlated, is of small 
size. The formula for the P.E. of a product-moment coefficient is: 


_ 0*6745(1 -rO 
■ VN 

Applying this to our r of + 0*579, where AT == 36, we getJP.E. — 
+ 0*075. Tt is usual to write the P.E. after the coefficient, thus: r — 
H 0*579 J_ 0 075. The P.E. of a rank order coefficient is slightly 
greater than that of a product-moment one, and is usually calcu- 
lated by the formula : 


P.E.p 


0*7063(1 ~ r^) 

Vn 


For the determination of the P.E.s or S.E.s of the other coefficients 
mentioned above, the reader should consult more advanced text- 
books, such as Kelley (1924) or Guilford (1936). 

The connotation of the P.E. of r or p is similar to that of an 
average or difference (cf. Chap. V). Thus I 0*579 ± 0 075 signifies 
that if the same history and geography papers were set to much 
larger numbers of similar children, and the marks inter-correlated, 
the true value of r so obtained would be as likely as not to lie 
between | 0*654 and + 0*504 (i.e. r 1 1 x P.E. and r— 1 x P.E.), 
but that there is a 1 in 20 probability that it might deviate as widely 
as + 0*804 or -f 0*354 or more (3 X P.E.). 

Just as a difference between averages should be two or more times 
its S.E. to be accepted as statistically significant, so a corrrelation 
should be 3 times its P.E. before it is regarded as showing a real 
relationship between the inter-correlated variables. A coefficient of, 
say, -f- 0*30 ± 0*10 is on the borderline of significance. For if the 
true value of r was zero, the scores of a limited sample of persons 
might yield + 0*30 once in 44 times. 

In our previous illustration + 0*579 is much greater than 3 x P.E. 
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Yet, as we have seen, this value is far from trustworthy, the reason 
being thp small size of N. Most investigators consider a correlation 
calculated from less than 25 cases to be almost worthless because 
of its low reliability, and they prefer to have at least a hundred cases 
before they base any important conclusions on the results of an 
experiment involving correlations. 

When numbers are small, it is better to calculate t instead of the 
P.E. 

_ ''a/N - 2 
1 - 

For r — 0*30 and N — 36, / - 1*922. Thus this coelBcient barely 
reaches the 0*05 confidence level. 

Fisher's z Method. — The P.E. method of indicating the reliability 
of r is inaccurate, not only when Yis small, but also when r is large, 
exceeding say 0*40. F^isher (1930) presents a better method which is 
generally adopted nowadays. It consists in transforming r into a new 
quantity called z, whose true S.E. is easily calculated.^ Applying 
this method to our r — 1 0*579, we find that the limits within which 
true r is as likely as not to lie are almost identical with the limits 
derived, above, from P.E.,, but that the 0 05 limits should be + 0*762 
and 4 0*310 instead of -| 0*804 and 1 0*354. 

Combining and Comparing Correlations. — The z method also pro- 
vides a useful way of combining two or more correlations. 

Ex. 47. — Suppose that another class of 43 pupils had taken the 
history and geography papers and yielded a coefficient of + 0*450 
± 0*082, what is the best estimate of the true correlation? The pro- 
cedure is to determine Zj and Zo for the two r’s, and then to compute 
the weighted average, 

^i(Ni - 3) + Z2(N2 - 3) 

Ni + Ng - 6 

Translating back from the average z gives us our answer, r = 
+ 0-512 ± 0*057. 

It is often required to find whether a correlation in one group is 
appreciably different from that in another group, or whether the 
difference can be ascribed to errors of sampling. When the numbers 
of cases are large the stock method may be used, 

^ z i|logp (1 + 0 — logp 0 — The S.E of z = Fisher 

provides tables of z for various values of r. 
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+ S.E.r2*. 

Ex. 48. — What is the significance of the difference between 
-[- 0-579 ± 0-075 and 0-450 ± 0-082? The S.E.s of these co- 
efficients are the i-®- O'HOS and 0-1216. — 0-1645. The 

difference is 0-579 — 0-450 — 0-129, which is not as large as its own 
ffj. A difference as large as or larger than this would occur 43 % of 
times as a mere matter of chance. 

As the numbers are small, Fisher’s z method should preferably be 
used. This consists in computing Zj, Z 2 , and their S.E.s, and applying 
the stock formula to these S.E.s. The difference in z’s is -1 0-1763 
in the above example. S.E.Zi -- 0-1741, S.E.Z 2 = 0-1581. a,i — 
\/0-174P F 0-158P - 0-2352. Here the difference is 0-75 times its 
(T./, and it might occur 45% of times as a mere matter of chance. 
Note that these methods do not apply to the difference between two 
correlations in the same group of people (cf. Lindquist, 1940). 



CHAPTER VII 


INTERPRETATION AND APPLICATIONS OF 
CORRELATIONS 

A CORRELATION coefficient is never an end in itself, but always a 
means towards the proper interpretation of numerical data. By 
studying the correlations between tests of various abilities it is 
possible to analyse scientifically the nature of such abilities. This is 
one important application, which we shall consider in Chap. VIII. 
But the primary value of a correlation is that it enables us to make 
predictions. If abilities Xi and Xg are known to be inter-correlated, 
then We can use measures of Xj to predict scores on X 2 , and vice 
versa. For instance a pupil’s probable scholastic success may be 
forecast from his performance on an intelligence test. This cannot, 
of course, be done with complete accuracy, but the correlation co- 
efficient will also tell us how accurate is the forecast, or what are its 
limits of error. 

Ex. 49. — The following scattergram shows the Intelligence Quo- 
tients of 440 children and adults on the Stanford-Binet scale, and 
their I.Q.s on the Vocabulary test from this scale. (The group or 
category 140 includes all testees with I.Q.s around 140, i.e. ranging 
from 135^ to 145^; similarly with the other groups.) Here the cor- 
relation is quite a high one, namely + 0*799 ± 0 012. Hence we 
would be justified in testing people only with the Vocabulary test, 
and predicting from their results what would be their Binet I.Q.s. 
Let us see how to make such predictions, and how good they 
would be. 

Suppose a child to obtain a Vocabulary I.Q. of 120, his Binet I.Q. 
may, according to the scattergram, lie as high as the 140 class, i.e. 
up to 145, or as low as the 100 class, i.e. down to 96. But presumably 
its most probable value will be the average of the column : 


X. 

F. 

140 

3 

130 

12 

120 

16 

110 

9 

100 

8 
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Xj - Vocabulary I.Q. 



60 

70 

80 

90 

100 

110 

120 

130 

140 

1 50 Average 

140 






2 

3 

1 

2 

4 12 1301 

130 





1 

0 

12 

5 

2 

1 27 121 5 

0 * 1^0 



1 

1 

5 

15 

16 

9 

1 

48 115-6 

110 



2 

9 

13 

33 

9 

4 

1 

71 107 6 

9 100 



8 

31 

36 

17 

8 



100 99 3 

II 90 


3 

19 

45 

22 

7 




96 91 2 

X, 

oc 

0 

2 

11 

17 

21 

2 

1 




S4 82 4 

70 

7 

8 

4 

5 

1 



1 

1 


25 74 0 

00 

3 

4 








7 65 7 

1 

12 

69 2 

20 

75 0] 

51 

88 0 

112 

91 9 

80 

99 9 

81 

109 9 

48 

118 5 

19 

121 0 

0 

128 3 

5 440 

138 0 


This works out at 1 18*5. Similarly for a child whose Vocabulary I.Q. 
is 70, the most likely value of his Binet I.Q. is 75*0, though it may 
lie anywhere between 56 and 95. 

Along the bottom of the scattergram arc given the averages of the 
columns, i.c. of the measures which most probably correspond 
to the X 2 measures at the heads of the columns. Fig. 23 is a graph 
of these figures. It will be seen that the points on the graph lie 
approximately on a straight line which runs through X^ — 100, 
X 2 - 100. That they do not all he exactly on the line is due merely 
to irregularities in the data. For though the numbers involved in our 
example are large, they arc still not large enough to yield really 
smooth normal distributions, or regularly spaced averages. 

Now as we approach the extreme values of X 2 , the corresponding 
values of X^ tend to lag behind; 118*5 is smaller than 120, 138 is 
much smaller than 150. Similarly with values below 100, 75 is not so 
low as 70, 69 is much less low than 60. This is an important and 
characteristic feature of all correlations. It is known as the regression 
of Xi towards the mean. The line drawn through the points on the 
graph is called a regression line, or a graph showing the regression 
of Xi on X 2 . 

Suppose now that the correlation is a perfect one. The scores on 
X 2 and Xi would be identical, and the slope of the regression line 
would be 45°. Next suppose that there is no correlation at all 
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Fig. 23. — Graph showing the most probable values of Xi for testees who obtain 

each value of Xg. 

between and Xg. It would then obviously be impossible to predict 
scores on one variable from those on the other. If wc took successive 
columns of the scattergram and averaged them, we should find that 
all these means lay close to the mean of Xj. Whatever the value of 
the Vocabulary I.Q., the average Binet l.Q. would be round about 
100. In other words, the regression line would be horizontal. From 
these two hypothetical cases we can now see that the nearer the slope 
of the regression line is to 45°, the higher the correlation ; the nearer 
it is to horizontal, the lowef the correlation. Furthermore, when the 
correlation is perfect, there is no regression towards the mean, since 
each Xi value is as big as each X 2 value. But when the correlation is 
zero, the regression is complete, since whatever the value of Xg, the 
value of Xi is always the mean. Thus the greater the regression, the 
nearer the correlation approaches zero. 

It has been assumed in the previous paragraph that the Xj and Xg 
measures are expressed in equivalent units. If the dispersions of the 
two distributions are very different, then the slope of the regression 
line corresponding to perfect correlation would not, of course, be 
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45°. However, as shown in Chap. IV, measures can be turned into 
equivalent units if they are expressed as deviations from their means. 


JC JC 

and divided by their standard deviations, i.e. as -- and — . If this is 

CTi (Jg 

done the slope of the regression line is identical with r. When the 
slope is 45 — 1*0, that is perfect correlation. When the slope 


is horizontal, — 0 0, that is zero correlation. 

X 21(^2 

The slope of the regression line plotted in Fig. 23 is 0*80. and 
ag are 17*49 and 17*84 (i.e. the Vocabulary I.Q.s are slightly more 

spread out than the Binct I.Q.s). Hence r --- ^ 0*815. 


This is very close to the figure already quoted for the correlation, as 
determined by the product-moment method. As a matter of histor- 
ical fact, r was originally determined by plotting the regression line 
and measuring its slope, before statisticians invented the product- 
moment method. The latter method is always employed now because, 
as we have seen, the regression line obtained by averaging the 
columns is somewhat inexact, even when Vis large. It is more useful 
to reverse the historical procedure, to calculate o-g, and /•, and 


from them to determine the regression line. If = r, then 

Xi rxo^-. This is the algebraic equation of the regression line, and 
“^2 

it can be used for predicting any individual’s most probable score 
from his A'g score. Alternatively, if wc wish to predict in terms of 
original measures, instead of measures expressed as deviations from 
their means, the formula becomes : 

(Xi - Ml) /(Xg - Mg)^^. 

^2 


Ex. 50. — What is the most probable Binet l.Q. corresponding to 
a Vocabulary l.Q. of 140? According to the averages listed at the 
foot of the scattergram, the answer is 128*3. But this is based on only 
6 cases and is therefore very unreliable. Applying the formula: 

Xi - 100 - 0-799 (140 - 100) x 

Xi 131*3 is the correct answer. 
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Regression ofX^ on — So far we have dealt only with the pre- 
diction pf Xi, knowing Xg. The opposite is equally possible. On the 
right-hand side of the scattergram in Ex. 49 are listed the average 
values of Xg which constitute the best estimates of the Vocabulary 
LQ.s of persons with various Binet I.Q.s. Here, too, there is regression 

towards the mean. The 
probable Vocabulary l.Q. 
corresponding to a Binet 
l.Q. of 120 is 115*6, and 
for a Binet l.Q. of 70 it 
is 74*0. A second graph 
may therefore be drawn, 
showing the regression of 
X 2 on Xi. Fig. 24 shows 
the two lines diagram- 
matically. If the correlation 
was perfect the two lines 
would coincide, both hav- 
ing a slope of 45 "" (provided, 
always, that Xj and Xg are 
measured in equivalent 
units). If the correlation was 
zero, the second line would 
be vertical, as in Fig. 25, 
since the only possible es- 
timate of Vocabulary T.Q. 
from Binet l.Q. would be 
100. With an intermediate 
correlation, such as our + 
0*799, the second line 
makes the same slope with 



Fig. 24.- 


-Regression lines when r < 
+ 0-799. 


Regression 
of X 2 0 n X| 


Regression 
of X| on X2 


Fig. 25. — Regression lines when r - 0 0. 


the vertical as does the first regression line with the horizontal; 
and the two cross at the means of Xi and X.^. Thus the alge- 
braic equation of the second line, which is also the formula for 

predicting Xg from Xi, will he: — rx^—, or, in terms of scores, 
(Xj - = r(Xi - M,p. 


Ex. 51. — What is the probable Vocabulary l.Q. of a testee with 
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Binet I.Q. 80? Xj = 100 + 0-799 (80 - 100) X = 83-7. 

The estimate based on the scattergram was 82-4, which agrees 
quite closely with our answer. 

P,E, of Estimated Scores . — The regression equations will tell us 
the most probable estimate of from Xg, or of Xg from Xj. But it 
is obvious from the scattergram that the actual Xj may fall within 
quite a wide range on either side of the estimated Xj, and that 
similarly an estimated Xg is only the average of a wide range of 
possible Xg’s. The ranges may be determined from the formulae : 

P-E-estim.x, = 0'6745ori\/ 1 — r'^ 

P*E.cstim.Xa ^ 0-6745(72^1 " 

Applied to Exs. 50 and 51 : P.E.esUm.x, 0-6745 X 17-49 Vl - 0-799^ 
= 7-1, P.E.cstim.X, = 7*3. 

The P.E. or S.E. of an estimated score has a similar meaning to 
those already considered, as may be shown by the following example. 
Consider a very large group of persons whose Binet I.Q.s are 110, 
and whose estimated Vocabulary I.Q.s are 108-1 ± 7-3. Then one 
half of them should have Vocabulary I.Q.s lying between 115-4 and 
100-8, and all but 1 in 100 should lie within a range of 2-58 x S.E., 
i.e. about 4 x P.E., or between 136 and 80. 

Actually we already know the Vocabulary I.Q.s of 71 persons 
with Binet I.Q.s round 110. This is not a very large group, but it 
should serve to test out our prediction. The scattergram is too coarse, 
but looking back to the original scores it appears that 37 of them, 
or 52%, had Vocabulary I.Q.s between 115 and 101 (inclusive), and 
that the two extreme Vocabulary I.Q.s were 138 and 80. This agree- 
ment is quite close. 

It is important to note that the P.E. of an estimated score de- 
creases as r increases. If the correlation were perfect there would be 
no error at all in estimating Xj from X 2 , or Xg from Xj. The lower 
the correlation, the less accurate will be the predictions. 

Ex. 52. — As an additional illustration, take the results from the 
36 history and geography papers. Here Vis far too small for accurate 
regression lines to be obtained by averaging rows or columns. But 
by the use of the formulae given above, probable geography marks 
can be estimated from history marks, or vice versa. For instance : 
what geography mark (Xg) corresponds to a history mark (Xi) of 7, 
and what is its P.E.? 

M.A.— 9 



r = + 0-579 

(X* - 9-945) = 0-579(7 - 14-139) X 

X» = 9-945 - 5-103 = 4-84 
P.E.estim.x. = 0-6745 X 4-819V1 - 0“579* 

= 2-65 

Therefore = 4*84 ± 2*65. 

Here the correlation is only moderate, hence the estimate is poor 
in reliability. All we can claim is that a pupil with a history mark of 
7 would be very unlikely to get a geography mark as high as 15 
(Xj + 4 X P.E.), and would most probably get round about 2 to 7. 

INTERPRETATION OF CORRELATIONS 

In the light of the above discussion it is possible to give a much 
more precise meaning to the conception of correlation than here- 
tofore. A correlation does not mean, as is sometimes supposed, the 
percentage agreement between the correlated variables. In the 
scattergram on p. 113, for example, only 37^% of the testees get the 
same I.Q., or rather the same class of I.Q., on Binet and Vocabulary, 
although the correlation is + 0-799. It is possible to interpret a 
coefficient in terms of the number of factors or elements which are 
common to the two variables, which therefore cause them to cor- 
relate with one another, but there is little practical point in so doing. 
What a correlation does indicate is the amount of reduction of error 
in predicting scores on one variable from scores on the other variable. 
When r is zero, the error is at its maximum, namely 0-6745a. Pre- 
dictions based on Xj might fall anywhere within the whole range of 
X 2 scores, and would be ho more accurate than drawing Xg scores 
out of a hat. With a correlation of 1-0 the error is reduced to zero. 
The extent of reduction of error for intermediate r’s is shown in the 
following table. Thesecondcolumnlists values of 100(1 — \/l — r*), 
and so represents the reduction in percentage terms. These figures 
are sometimes referred to as the forecasting efficiencies of the r’s. It 
will be seen that the forecasting efficiency only becomes high enough 
to be of great practical value when r is in the neighbourhood of 0-8 
to 0*9 or more. Great caution should be exercised therefore before 
jpredicting, say, success at a job from aptitude tests, since these 




usuaHy gi^ correlations of only + 64 to"+ 

to predict. Such deductions are only 8% to 20% more aixnirate^'^ 

pure chance guesses. 


r 

Reduction of error, or fore- 
casting efficiency % 

000 

00 

010 

0-5 

0-20 

20 

0-30 

4-6 

0-40 

8-4 

0-50 

13-4 

0-60 

200 

0-70 

28-6 

0-80 

400 

0-90 

56-4 

0-95 

68-8 

100 

100-0 


An alternative method of interpretation is suggested by McClelland 
(1942). Suppose we are selecting the top 20 of a group of 100 can- 
didates, either for secondary schooling or for a job, on the basis of 
their intelligence test or selection battery scores. We know that the 
test gives a correlation of r with later success. What proportion of 
candidates is correctly allocated, or how many do we wrongly select 
and wrongly reject? The position for various levels of r is shown in 
the following tables. 


Success- Unsuc- 
ful later cessful 


Pass test 

20 

0 

’ 20 

Fail test 

0 

80 

; 80 

1 

20 

80 

1 100 


r = 

100 



4 

16 ! 

20 

14 

6 ; 

20 

16 

64 ; 

80 

6 

74 : 

80 

20 

80 i 

100 

20 

80 ; 

100 

r 

= 0-00 


r = 

= 0-85 



In the first table all the best 20 on the test turn out well and the 
bottom 80 badly ; correlation is perfect. In the second, r is zero and 
our selection is no better than picking names out of a hat, since as 
large a proportion of the 20 who pass do well later as of the 80 who 
fail the test (namely 20% of each). Nevertheless, 64 and 4 are 
correctly allocated, and the proportion of errors is 32%. The third 
situation represents the typical situation in secondary school 
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selection, where the battery of intelligence and attainments tests cor- 
relates about 0-85 with later school work. The number of correct 
allocations is now 88, and of errors 12 %. (Note, however, that nearly 
one-third of the 20 Passes who actually get into the grammar school 
gain unsatisfactory marks later.) Thus we have reduced errors by 
32 — 12 

— 22 — = 62i%. Here are the corresponding reductions for other 
levels of r. 


r 

Reduction of errors 
in prediction % 

000 

0 

010 

6 

0-20 

12 

0-30 

17 

0-40 

23 

0-50 

30 

0-60 

37 

0-70 

46 

0 80 

57 

0-90 

70 

0-95 

80 

100 

100 


The Influence upon Correlations of Irrelevant Common Factors . — 
A positive correlation which is statistically significant (i.e. 3 or 
preferably times its P.E.) always indicates some kind of causal 
connection, or some common factor or factors running through the 
two variables — even when.the connection is not strong enough for us 
to be able to forecast one variable from the other. Considerable care 
is needed, however, in interpreting such a connection, for its real 
nature may be quite unsuspected, or the common factors may be of 
an entirely irrelevant kind. Suppose, for example, that we correlated 
the size of boots worn by all the children in a school with the speed 
of their handwriting, we should undoubtedly obtain a moderately 
high coefficient. Yet it would be absurd to say that big boots cause 
quick handwriting, or vice versa. The real reason for the correlation 
is an irrelevant common factor, in this case age. The older children 
on the whole write more quickly and have larger feet than the 
younger ones. If we took children all of the same age and compared 
boots with handwriting, we should be likely to find that the corre- 
lation had fallen to zero. 
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Consider another, rather more complex, illustration. A positive 
correlation exists between backwardness in school work and un- 
employment among the children’s parents. Idealists may at once 
jump to the conclusion that unemployment brings about malnutri- 
tion and anxiety among the children, and that these make their work 
fall off. But here also there is an important common element which 
goes some way to explain the connection. Unemployed parents are 
known to be on the whole less intelligent than employed ones, and 
unintelligent parents tend not only to have unintelligent children, 
but also to give them less educational encouragement than do in- 
telligent parents. These factors — low intelligence and generally un- 
favourable environment — are far more important causes of back- 
wardness than is malnutrition. We need to take two groups of chil- 
dren, one with parents employed, one unemployed, groups whose 
average intelligence level is identical, and then study their school 
work to see whether the former do better than the latter. Only when 
we have ‘held the intelligence factor constant’ in this example, or 
‘held the age factor constant’ in the previous example, can we 
legitimately deduce some causal connection. 

Partial Correlation , — The above type of problem or difficulty in 
interpretation is constantly occurring in psychology and education 
Sometimes it can be met by applying what is known as the partial 
correlation technique. For this it is necessary to obtain measures of 
the suspected irrelevant factor, which we shall call Xg, and to work 
out the correlations between Xj and Xg (r^g), and between Xg and 
^3 ('’zOj well as between X^ and Xg ('' 12 )-^ We may then ‘hold Xg 
constant’ or ‘partial it out’ by the following formula: r^^.g (meaning 
the correlation of X^ with X> when Xg is eliminated) — 

^12 ■“ ^13''23 
a/1 — 1 - 7*23^ 

Ex. 53. — Suppose the correlation between boots and handwriting 
speed to be h 045, that between boots and age 1 0*70, and between 
handwriting and age + 0-60. Then the correlation between boots 
and handwriting when the influence of age is removed is : 

^ Thouless (1939fl) has pointed out that the partial correlation formula can 
only legitimately be used if X 3 can be measured without error. It is thus most 
.appropriate for holding age constant, but not for, say, intelligence; since 
measurements of the latter always involve some error. However, Thouless pro- 
vides an alternative formula which may be used when the reliability of X 3 is 
known. 




The Effect of Homogeneity and Heterogeneity upon Correlations. — 
We have just seen that when pupils are heterogeneous as regards age, 
the correlation between a pair of their characteristics may be 
spuriously raised. The same is true of other types of heterogeneity. 
Suppose we calculate the correlation between two tests in an ordinary 
school class, and then eliminate from consideration most of the 
pupils of medium ability on both tests, and re-calculate the correla- 
tion, the second figure will certainly be larger than the first. The 
distributions will not, of course, be normal, the diversity or hetero- 
geneity of the pupils will be unduly large and this will increase the 
size of r. In ordinary practice the reverse, namely undue homogeneity, 
is more likely to occur without the investigator realizing it. Thus the 
correlations between intelligence tests and school work in an ordinary 
school class are usually lower than they would be in an unselected 
group of children of the same age. For such a school class is more 
homogeneous as regards intelligence and school work than is the 
population as a whole. The dullest children of that age are likely to 
be at a special school, and many of the brightest ones may be at 
private schools. Take an extreme case: if we selected a group of 
children all of precisely the same intelligence, any correlation be- 
tween intelligence and work would disappear. One important reason 
why intelligence tests appear to be of much less value for predicting 
scholastic aptitude among secondary grammar than among primary 
pupils, and to be poorer still among university students, is that 
secondary pupils are more homogeneous or more highly selected 
than primary. Most of the children of I.Q. less than 1 10 are weeded 
out before the grammar school is reached, and university entrance re- 
quirements remove many pore from the middle and lower end of the 
scale. Thus correlations necessarily sink as we pass from unselected 
children to the primary school, from the primary to the grammar, 
and from the grammar to the university level. The typical correlation 
of + 0*85 which we have quoted for selection tests at 11 -f with 
later school work drops to about + 045 within a grammar school 
group. 

It is not easy to decide just what degree of heterogeneity should be 
regarded as normal and reasonable, nor to specify the extent to 
which any given group of persons exceed or fall short of this degree 
of heterogeneity. Hence the correction of correlations either for 



t^!^'ltotnogradty‘'br excessive heiero^efiy is counplicated. 
reader may be referred to discussions by Thomson (1939) aai 
Thorndike (1949). 

The main conclusions to be drawn from the above sections is that 
the investigator should always scrutinize his correlational data very 
carefully, first in order to see whether any irrelevant common factors 
may be increasing spuriously the size of his coefficients, secondly to 
judge whether or not the diversity of scores in the group tested is 
unusually wide or restricted. He should realize that the absolute size 
of a correlation means very little, since this size may be so much 
altered by uncontrolled common factors or heterogeneity. His inter- 
pretation of the amount of relationship between the correlated 
variables should be made relative to the particular data with which 
he is working 

In contrast to the absolute size, the predictive value or forecasting 
efficiency of a correlation is not affected by heterogeneity. For the 
latter, as we have seen, depends on aV 1 — Hence an increase in 
r on account of excessive heterogeneity is offset by the increase in cr, 
and the P.E. remains the same It should be noted further that fore- 
casting efficiency is not m the least upset by irrelevant common 
factors. True, it might be foolish to predict speed of handwriting from 
the size of boots, yet the correlation obtained between these variables 
would provide a valid means of making such predictions. 


COMBINING TWO OR MORE TESTS 
The educational statistician is frequently faced with the problem 
of combining the results of several examinations and tests, and using 
the total scores for the selection of the best pupils. We have already 
dealt rather fully with the point that the distributions to be combined 
should be expressed in equivalent units, i e. that they should possess 
the same means and S.D.s, except when it is desired to give more 
weight to one set of marks than to another, in which case the former 
should have the larger S.D., not necessarily the larger mean. Here 
we must draw attention to an important fact which markers seldom 
realize, namely that the S.D. or dispersion of combined variables is 
always less than that of the separate variables by an amount depend- 
ing on the lowness of correlation between these variables. This fact 
is a natural corollary of the facts of regression outlined above. 

Suppose, for example, that total marks are based on the average 
of six examinations, each with a mean mark of 60, a range of 30 to 
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90, and a S.D. of 10. If the average correlation between the six 
variables is + 0-70, the final distribution will probably have a S.D. 
of 8-66 and the average totals will range from 34 to 86. But if the 
average inter-correlation is only + 0*30, the final distribution will 
have a S.D. of 6*43, and the range will drop from 30-90 to 41-79. It 
is easy to see why this must be so. When the average inter-correlation 
is moderate or low, no pupil or student will get very high marks, or 
very low ones, on all the examinations. Hence the highest and lowest 
average marks will be distinctly nearer to the mean than the highest 
and lowest marks on each separate examination. The same phenom- 
enon occurs when marks on different questions in a single exam- 
ination are totalled. For instance, if five questions are marked, each 
with a range of 5-20 out of 20, the examination totals will probably 
range from about 40-85, and not from 25-100. 

A chief examiner, head teacher, or other marking authority should 
make allowance for these facts. If he wishes the distribution of final 
totals to show a certain proportion of very high (over 80%) marks, 
and a certain proportion of low (under 40%) marks, then he must 
either direct the markers of the component examinations to employ 
extra-wide distributions (i.e. to award much larger proportions of 
marks over 80% and under 40%), or else he must re-scale the final 
marks so as to spread them out again into a distribution with the 
required dispersion. The formula connecting a, the S.D. of the com- 
ponent distributions, and , the S.D. of these distributions averaged, 

is: — a I ~~ where R is the average correlation be- 

V n 

tween all the sets of mark's, and n the number of sets.^ 

Multiple Correlation, — If two tests or examinations each correlate 
moderately with a third, then the two combined will usually correlate 
higher with the third than either separately. For example, English 
and arithmetic examinations at 11 usually yield correlations of 
about + 0*40 with subsequent achievement in grammar schools. The 
combined mark is likely to correlate about + 0*45 with such achieve- 
ment. Every educationist has realized and applied this principle. 
With the aid of statistical methods we can formulate it more precisely. 

Let there be n tests which give correlations rgA, ^-3^ . . . etc., 

^ Alternatively, may be substituted for , where (Tsum is the S.D. of the 
n 

total scores. The same formula, turned around, is often useful for finding the 
average inter-correlation of n variables when their separate and combined S.D.s 
are known. Cf. Kelley (1924), Formula No. 171. 
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with some measure of achievement, A, the sum of all these coeffi- 
cients being ; and let the correlations of the tests with one an- 
other be ri2, fis, Tja . . etc., their sum being Er^^- Then the correla- 
tion of A with the combined tests is : 


'"aCi + 2 + 3 + - • • +n) — 


Er 


lA 


V« + 2Z'ri2 


Ex. 54. — Three different types of intelligence tests gave correla- 
tions of + 0*483, + 0*456, and + 0*337 with students’ marks in 
psychology. = 1*276. Their inter-correlations were + 0*507, 
+ 0*468, and + 0*508. 27^*12 = 1*483. What will be the correlation 
of the combined tests with psychology marks? 


^a(i 42 + 3 ) 


1*276 

V3 I- 2“x 1*483 


4- 0*522. 


The above method is a simple but crude way of combining tests 
so as to give better predictions of achievement. Naturally some of 
the tests are superior to others, and so should receive more weight 
in the total score. Further, the most useful tests will be the ones which, 
as well as correlating highly with A, do not correlate highly with the 
other tests that are being employed. For obviously, if tests 1 and 2 
are measuring almost precisely the same thing, they will predict, as 
it were, the same aspect of A. Whereas what we want are tests which 
predict different aspects, and so between them cover the whole of A 
more thoroughly. Thus it may be better to leave out test 2, and 
substitute another which, while perhaps correlating less well with A, 
has the advantage of being almost independent of test 1. The method 
known as multiple correlation enables the educationist or psycholo- 
gist to choose from a number of tests those which will in combination 
give the highest possible correlation with A, and to weight them 
suitably. Multiple regression equations can also be calculated which 
predict an individual’s most probable score on A by combining his 
marks on the separate tests in appropriate proportions. The method 
is too elaborate to be given here, but the above discussion will, it is 
hoped, indicate its object and main principles. 

The vocational psychologist applies an analogous procedure in 
selecting the best candidates for a certain occupation, that is, he 
gives a series of tests each of which has been found to correlate 
moderately with success at the occupation, and combines them by 
the multiple correlation method so as to obtain as accurate predic- 
tions as possible. 
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TEST RELIABILITY 

If a test or examination is applied a second time under similar 
conditions, and the testees’ scores differ widely from those previously 
obtained, the test is obviously a poor one. It is said to be reliable 
only if the two sets of scores correlate highly with one another. 
Further, if different testers apply and score a test or examination, 
they should arrive at the same, or nearly the same, scores. Repetition 
of a test may, however, give an unfair picture of its reliability, since 
the testees may remember their previous responses. Many tests are 
therefore supplied in two or moie parallel forms, so that if a re-test 
is desired, different questions may be set which should, nevertheless, 
yield much the same results as did the questions of the first test. 
When no alternative form is available, a single test is often split into 
two equivalent halves. For instance, the scores on odd- and on even- 
numbered questions or items may be totalled sepaiately, and then 
inter-correlated. But it is known that test reliability depends on the 
length of the test, hence the correlation between one half and the 
other half is unduly low, and is usually corrected by the formula: 

2r 

R = — ; — . Here r is the obtained coefficient, and R the coefficient 
1 + r 

to be expected had it been possible to compare the whole of the test 
with another similar test. 

In all these instances — namely repetition, application and scoring 
by a different tester, parallel form, and corrected split-half — we 
generally expect to get an inter-correlation of at least |- 0*90. For if 
this reliability coefficient is much lower it would indicate that the 
scores are too unstable to be trusted. Standardized educational and 
intelligence tests generally reach this level of reliability, but ordinary 
examinations often fail to do so, as we shall see in Chap. XL Many 
performance tests, tests of occupational abilities, and character or 
temperament tests, arc also often poor in reliability. However, by 
increasing the length of a test, or by combining several parallel tests, 
a more satisfactory coefficient may be attained. 

Knowing the reliability coefficient, it is possible to calculate the 
reliability, or the limits of variation, of individual scores, by the 
formula already given: P.E.est,mXi = 0-6745o-i\/l — For ex- 
ample, if parallel intelligence tests give a correlation of + 0-93, and 
the S.D. of their I.Q.s is 15, then P.E.j.q 0*6745 X 15Vl — 0*93* 
= 3*71. But if the reliability coefficient is only + 0*70, P.E.i.q, = 
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7-21, that is almost twice as large. Such a P.E. is, as we have already 
seen, an index of the extent to which scores on the second test may 
vary when predicted from scores on the first test. About half the 
testees will obtain I.Q.s on the second test within 4 points (if 
r ~ 0-93), or about 7 points (if r — 0*70), of their I.Q.s. on the first 
test. But rare cases (1 in 100) may alter by as much as 4 ^ P.E., i.e. 
by some 15 and 29 or more points respectively. 

The application of two parallel forms of a test to the same indi- 
vidual yields, as we have seen, somewhat different results. If we 
possessed a large number of additional parallel forms and applied 
them, still other results would be obtained (quite apart from effects 
of practice). Such results for one individual would tend to conform to 
a normal distribution, and their grand average would, of course, be 
the best possible measure of the individual’s standing on the test — 
what may be called his true score. The situation is exactly analogous 
to that which we discussed in connection with group differences 
(Chap. V), except that there we were concerned with variations 
among the means of sample groups, here with variations among an 
individual’s scores on sample tests. An individual’s score therefore 
possesses another P.E., which indicates its liability to deviate from 
his true score. The formula for this also involves the reliability of 
the test, and the S.D. of the test scores of an unselected group of 
persons. P.E. — 0-6745(7\/l ~ Thus the P.E.s of I.Q.s for tests 
with reliability coefficients of | 0-93 and |- 0*70 arc 2-67 and 5*53 
respectively. Note that these P.E.s are smaller than those quoted 
above, since Vl - r is necessarily smaller than V 1 r\ This is to 
be expected, since the former P.E. represents the deviation of one 
sample score from another sample, whereas the latter rcpiesents the 
deviation from the mean of all the possible samples. 

Factors Influencing Test Reliability . — Since the reliability of a test 
is almost synonymous with its thoroughness, it can be increased or 
lowered almost indefinitely by lengthening or shortening the test. 
The formula for predicting the reliability of a test whose length is 
doubled has already been cited. When it is lengthened n times, the 

nr 

formula becomes; R — , , , . This is known as the Spear- 

M {n— l)r 

man-Brown prophecy formula. 

Ex. 55. — A short educational test has a reliability coefficient of 
+ 0-60. How much must it be lengthened to attain a coefficient of 
4 - 0-90? Turning the above formula round, we get: 
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^ R(1 -r) _ 0-90 X 0-40 
” /-O -R) d-60 X 010“" ® 

It should be six times as long. 

The same calculations would apply to the following example. 
If two examiners mark a set of examination scripts, and their marks 
inter-correlate i 0-6(), how many examiners should mark the scripts 
for their combined marking to attain a reasonable reliability of 
+ 0-90? The answer is six. 

The formula assumes that the correlations between the additional 
tests (or examiners) would be the same as between the first pair of 
tests (or examiners). This condition seems generally to be satisfied 
by educational measurements. 

A second important factor is the heterogeneity of the group whose 
scores ^irovide the reliability coefficient. Parallel forms of an in- 
telligence test will intcr-corrclatc much more highly if applied to 
school-children of several years’ age range than when applied to a 
single class or to a single year-group (i.c. children whose ages range 
over only one year). Coefficients obtained from the latter groups arc 
conventionally accepted as fairer indices of reliability than co- 
efficients from the former. If we know o-, the S.D. of test scores in a 
group which is unduly homogeneous or heterogeneous, then it is 
possible to predict /?, the reliability to be expected in a group with 

S.D. 2’, by Kelley’s formula: R -- 1 - (1 -- r). 

Ex. 56. —The reliability coefficient of an intelligence test when 
applied to a class of primary school-children was h 0-88. The S.D. 
of their l.Q.s was 14. What would be its reliability in a secondary 
school class whose l.Q.s have a S.D. of only 9? 

R - 1 - X 012 -1 0-71 

Note that the P.F. of an individuaPs score is not upset by this 
factor, as is the reliability coetficient. In the above example it works 
out at 3-27 in either class. 

The assessment of test reliability and its implications are complex 
topics, and the account given here covers only the more elementary 
points. Advanced students are advised to consult Thorndike (1949) 
and Gulliksen (1950). 

Function Fluctuation , — Most reliability coefficients fall below 1-0 
for two distinct reasons, first, through the test being insufficiently 
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thorough (i.e. liable to errors of sampling), and, secondly, because the 
persons tested or examined may alter in their level of achievement 
between one test and another. The latter factor has been termed 
‘individual variance’, or, ‘function fluctuation’ (Thouless, 1936). 
Naturally it has a greater effect on coefficients obtained from parallel 
tests given on different occasions, or from repeated tests, than it 
docs on coeflicicnts obtained by the split-half* method. Hence the 
difference between these two types of cocllicicnt affords a means of 
distinguishing between the reliability of the test itself, and the ten- 
dency of testccs to fluctuate in respect of the ability tested. Thouless 
provides a formula for estimating function fluctuation from this 
difference. 

Effects of Test Reliability on Test Jnter-correlations . — A test which 
is not perfectly reliable may be regarded as measuring so much of the 
function which a perfect test would measure, and so much error. 
The error part of it is quite useless, and will have the effect of re- 
ducing the correlation between this test and any other test. If 
both tests possess considerable errors through lack of reliability, 
their inter-correlation will be considerably reduced. In fact, the 
maximum possible value of such an inter-correlation will be, not 10, 
but the square root of the product of their reliability coefficients. 
Suppose we find a correlation of 1 ()'5() between a selection ex- 
amination and subsequent achievement, and that tlie reliabilities of 
the examinations and the measure of achievement arc only ' 0-70 
and i 0-80. Then the true correlation, if it were possible to obtain 
perfectly reliable exams and measures of achievement, would be 
i 0-50 

, -1 0-67. 

VO-70 ^ 0-80 

Spearman has called this reduction elfect of imperfect reliability— 
‘attenuation,’ and has provided formuhe by w hich it may be corrected 
or compensated. Such correction is, of course, of theoretical 
interest, since it tells us what correlations to expect between various 
psychological abilities if they could be acciiraiely measured. But it is 
of little practical value, since completely accurate tests arc un- 
attainable, and for all purposes of prediction we have to employ the 
obtained, uncorrccted, correlation coellicients (cf. Thouless, 1939i/). 
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ANALYSIS OF ABILITIES 

Fallacious Views of Mental Organization. — Correlational investiga- 
tions have assumed especial importance in recent years, since they 
enable us to clarify our conceptions of human abilities, and of the 
organization or structure of the mind. The layman’s notions as to 
what abilities exist, and what each ability includes or excludes, are 
extremely loose, and the same was true of psychological theory 
throughout most of last century. Teachers and parents are heard to 
remark: “Johnny is poor at book-learning, but he makes up for it 
by his cleverness with his hands. He will be a good mechanic when 
he grows up.” “Mary has an excellent memory and good power of 
attention.” “Willie is slow but sure,” and so on. Such statements are 
now known, as a result of experimental investigation, to be largely 
fallacious. Only exact study can tell us how far the slow person is 
likely to be sure, whether manual ability in a boy has any predictive 
value for mechanical ability in a man, whether there is any such 
entity m the mind as memory. Again, the so-called faculty school of 
psychologists of a hundred years ago considered the mind to be 
made up of a number of distinct powers or faculties such as reason- 
ing, judgment, imagination, etc. Indeed, it was at one time regarded 
as the main object of education to train each of these faculties by 
providing children with appropriate ‘mental gymnastics.’ As soon, 
however, as scientific psychologists tried to define the faculties pre- 
cisely and measure them, and to determine the effects upon them of 
various types of training, they fell into discredit. A priori analysis has 
not even been able to decide what is the true nature of intelligence. 
The accounts of it given by different writers have been extremely 
varied and discrepant. Nowadays, therefore, such problems, both of 
theory and practice, arc approached by means of experimental 
studies of correlations between tests. For they all eventually resolve 
into the question — what correlates with what? 

The Scientific Definition of an Ability. — First we must define what 
we mean by an ability, capacity, or faculty. It implies the existence 
of a group or category of performances which correlate highly with 
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one another, and which are relatively distinct from (i.e. give low 
correlations with) other performances. Take, for example, mechan- 
ical ability. Some people are better than others at tasks involving 
manipulation of mechanisms, and this ability is fairly consistent or 
general in the sense that those who arc good at one such task are also 
usually good at others. The consistency is not perfect. A may be 
especially good with meccano, less clever with locks, B the opposite, 
C the best with electrical fittings, and so on. But if there was no 
appreciable correlation between these and other similar perform- 
ances, if they were all found to be specific, we should not be entitled 
to regard mechanical ability as a real entity. Another important 
condition is that the performances should be fairly reliable. We do 
not expect perfect stability, but if people fluctuated wildly in their 
success at mechanical tasks from day to day, we should hardly 
recognize the existence of mechanical ability. There is a further possi- 
bility, namely, that the various performances may inter-correlate 
well, but that the correlations may be accounted for by some other 
common factor such as general intelligence (the age factor, we will 
assume, has already been eliminated). Mechanical ability would not 
then be anything distinctive. However, intelligence tests can be 
applied to the same persons who take the mechanical tests, and 
intelligence can be partialcd out or held constant. It is then found 
that the mechanical tests still overlap, or show positive correlations 
with one another over and above their correlations due to the intelli- 
gence factor. We are, therefore, able to accept mechanical ability as 
something consistent and distinUive. Tests of it do correlate sufli- 
ciently well and reliably for us to postulate some common element 
running through them, 

A common element such as this is often referred to as a group 
factor, since it occurs in a group of performances of a certain 
restricted type. It differs from a general Jactor like intelligence, 
which is found to run through an extremely wide range of tests, which 
indeed enters to some extent into all abilities. 

Since the consistency or overlapping of different mechanical tests 
is not perfect, no one test can give a really adequate measurement of 
mechanical ability. But by combining several tests we are likely to 
get a result much more representative of the group factor as a whole. 
The same applies, of course, in the measurement of educational 
abilities. We do not expect long-division sums alone to tell us how 
good pupils are at arithmetic, although they provide some indication 
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of the ability. Instead we set several types of sums, and try to cover 
the wholp field of arithmetic which the pupils are supposed to know. 

Arithmetic is likely to yield a group factor similar to the mechani- 
cal ability factor. But many of the faculties assumed by psychologists 
in the past, or by laymen at the present time, fail to stand up to the 
criteria enumerated above. Memory, for example, refers to so many 
different things, that it is meaningless to talk of so-and-so as having a 
good or bad memory in general, or of ‘ training children’s memories. ’ 
The speeds with which people learn various sorts of material, and 
their retentivencss (i.c. the amounts of such material which they can 
recall after an interval), give only moderate or low inter-correlations, 
especially when the influence of intelligence is removed. Probably 
there exist several small group factors, each representing memory 
for a certain narrow range of material.^ In other words, there may be 
several 'dilfercnt types, but no single entity, of memory. 

The notion that mental speed diflers from, and is usually opposed 
to, mental accuracy or power also appears to be fallacious. If there 
were two quite distinct abilities, then tests of speed should give high 
correlations with one another, and low or negative correlations with 
tests of accuracy or power. Under appropriate conditions of testing, 
difficult untimed tests and easy speeded tests do yield somewhat 
different results; thus a partial distinction is indicated. But they still 
correlate positively to a moderate or a high degree, showing that, on 
the whole, those who are ‘slow’ at mental work arc more likely to be 
‘unsure’ than ‘sure.’ 

All Mental Abilities are Positively Jnter-correlatecl. —Let us now turn 
to the wider problem — what abilities does the mind contain, and how 
are they organized or inter-related? A first, extremely important, fact 
is that all tests of mental abilities tend to give positive inter-correla- 
tions. Even tests of manual, physical, and other non-intellectual 
functions also usually correlate positively with one another and with 
mental tests, though the coefficients arc often very small, whereas 
the coefficients among tests of intellectual functions are generally 
moderate to high. In selected groups, such as university students, 
occasional negative correlations may arise; but these are not typical, 
and they are seldom statistically significant. In other words, talent and 
versatility, more often than not, go together; a person outstandingly 

^ A summary of the evidence on this and other types of ability, obtained 
between 1940 and 1950, is given in the writer’s book, The Structure oj Human 
Abilities (1950). 
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good or bad in one field is usually above or below average, re- 
spectively, in all other hclds This fact at once serves to casl a great 
deal of doubt on the ‘compensation theoiy ’ i e the view that those 
who arc poor at one thing make up lor it by being good at something 
else It does not entirely dispiove it loi the following reasons 
First the reader should study the seitlcigiam lor a very low 
positive con elation sueh as that shown in hig 26 This eomparcs 

INTELLIGENCE TEST (X 2 ) 
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I K 26 SeittLiei nil iiul ili iLssion lines illiisti itine, i eorrehtion 

/ 0 146 


the marks of students in an e\ nnination in hymeiie with then in- 
telligence test scoies and yields an ) ol 0 146 It will be seen that, 
though theie is still a slight tendenev foi high seoies on \i to be 
associated with high seoies on X >et the correlation admits ot many 
big exceptions Students who are among the highest foi intelli- 
gence may range right down to the lowest IS tor h>giene The 
large amount of regiession means that a Ian piopoition of persons 
can be high on one vaiiable, low on the othei Such a state ol aflairs 
M A — 10 



134 THJE MEASUREMENT OF ABILITIES 

is certainly much more likely to occur here than it is when the cor- 
relation is high (as in Fig. 23). 

Now ‘book-learning’ and ‘being clever with the hands’ are not 
highly correlated, hence it is quite possible for some children to be 
much better at one than at the other. But it is definitely false to 
suppose that they arc inversely related, that those who arc bad at 
one are therefore good at the other. On the contrary, those who are 
superior in intellectual abilities are on the average superior in prac- 
tical ones also, though to a lesser extent. 

A second reason is that age differences often obscure the issue. 
Some teachers arc incredulous when told that children of superior 
intelligence tend to be superior also in growth and physical capacities 
to dull children, since they know well that the dullards in an ordinary 
school class are usually larger and stronger than the bright young- 
sters. But, then, they arc forgetting that the dullards are far older 
than the bright ones, and so may be physically superior. If the dull 
children were compared with bright ones of the same age, the 
physical superiority of the latter would at once be obvious. Never- 
theless, in this instance also the correlations are so low that a great 
many exceptions arc possible. 

Thirdly, ability is obviously to some extent a matter of interest 
and practice. Pupils who arc backward at intellectual studies and 
who arc on the average backward, but much less so, in physical and 
manual pursuits, naturally devote more time and energy to such 
pursuits. Bright pupils, though initially stronger and better at prac- 
tical matters, may be more interested in intellectual matters, and so 
fall behind in other things. It is possible, therefore, for the differences 
that exist between abilities which give low inter-correlations to be- 
come accentuated, though there is still no evidence known to the 
writer that the correlations ever become negative. 

We may conclude, then, that all the abilities with which teachers 
are concerned are, to a greater or lesser extent, positively correlated. 
As an example, we will quote some of Burt’s (1917) results. He gave 
thirteen objective tests of achievement in various school subjects to 
120 children aged 11 +, and calculated the inter-corrclations. The 
following table shows the average correlation between each test and 
the other twelve tests. ^ 

^ The average reliability of the tests was + 0*738. All these coefficients would, 
therefore, be larger by roughly 35 % if the tests had been perfectly reliable (cf. 
p. 129). A later repetition of the investigation with a much larger number of 
children has yielded closely similar results (Burt, 1939a). 
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Composition, history, geography, nature study, and arithmetic 
problems overlap with the other tests to the largest extent. Mechani- 
cal arithmelic, drawing, and writing quality correlate to the smallest 
extent. This shows that there is a tendency for a pupil’s achievements 
in all subjects to run on a fairly even level, though the correlations 
are low enough for some pupils to exhibit considerable divergences — 
special strengths in some subjects, special weaknesses in others. 
During the secondary school stage, correlations between subjects 
decrease, partly perhaps because abilities tend to differentiate during 
adolescence and specialised talents develop, but mainly because the 
pupils form more homogeneous groups (cf. pp. 122-3). Nevertheless, 
the positive overlap of all intellectual abilities continues even beyond 
the university level. For example, the present writer inter-correlated 
the marks of 300 graduate students at a training college, and obtained 
small positive coefficients between subjects as widely different as 
teaching skill, arithmetic, hygiene, education, psychology, speech 
training, singing, physical training, etc. (cf. Vcinon, 1939). 


Composition 

. 1 0-505 

History .... 

. f 0-448 

Geography 

. + 0-456 

Nature Study 

. + 0-458 

Mechanical Arithmetic 

. ( 0-283 

Arithmetic Problems . 

. f 0-457 

Reading Fluency 

. 1 0-323 

Reading Comprehension 

. 1 0-369 

Writing Speed . 

. + 0-323 

Writing Quality 

. 1 0-274 

Dictation .... 

. 1 0-325 

Drawing .... 

. 1 0-275 

Handwork 

. + 0-320 


These facts are of considerable importance to examiners. They 
indicate that a General Certificate in several subjects, achieved at the 
age of 15 to 18, is some criterion of aptitude for any intellectual 
occupation, or for a university career, albeit not a very good one. 
Similarly, the traditional method by which the Civil Service selected 
people for a wide variety of posts was a general examination based 
largely on subjects studied at school or university. I his implied that 
the possession of ‘brains’ for one subject indicates aptitude for all 
other subjects. Though this method was a fairly sound one, the 
.Civil Service has now come to realize that the efficiency of its 
selection can be improved by adding tests of the aptitudes needed 
for each type of post. 
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The General Factor Responsible for Positive Overlap , — Now when 
we find positive association, as in Burt’s investigation, wc arc justified 
in assuming the existence of a general factor - general educational 
ability— which runs through all the separate subjects. Some sub- 
jects may be said to be more ‘saturated’ with it than others, since they 
are more highly correlated with it. In the primary school we should 
get an excellent indiccition of this general factor from pupils’ marks 
in composition and in arithmetic problems, a much poorer indication 
of it from drawing or handwriting quality. Statistical methods arc 
available for assessing a pupil’s standing on this general factor from 
his marks on all the component subjects (cf. p. 124). Further, it is 
quite easy to partial out the factor, and so to see whether or not it 
accounts for all the correlations between the separate subjects, 
whether, in other words, it is the only common element involved. 
When this is done, it is found that there are still a number of corre- 
lations over and above those due to the general factor. In Burt’s 
study, the residual correlations indicated considerable overlapping 
within the following four groups of tests: (1) arithmetic problems 
and mechanical arithmetic; (2) handwork, drawing, writing speed 
and quality; (3) dictation, reading comprehension and speed; 
(4) composition, history, geography, and nature study. But there 
was little or no ovcilap, or even negative coi lelations, between the 
tests in one of these groups and those in another group. Such results 
show clearly the existence of group faclois, i.e. of distinctive types of 
educational abilities, over and above general ability. In this instance 
there is an arithmetical factor, a manual one, a linguistic one, and 
one including all the higher, more integrative, school subjects. As 
Burt points out, wc have hcic a basis for cross-classification of 
pupils. He suggests that they should be assigned to school classes in 
accordance with their achievements in the linguistic and ‘integi alive’ 
subjects, and that thc> might advantageously be leclassified for 
arithmetic lessons, and again for manual subjects. 

Analogous findings were obtained in the wi iter’s study of tiaining 
college marks. Three separate ‘families’ of subjects could be dis- 
tinguished, a practical one (teaching skill, speech training, and 
physical training), a scientific one (psychology, hygiene, arithmetic), 
and a literary one (English, speech training, education, history). Few 
researches along these lines have been carried out in the industrial 
sphere, owing to the difficulties of measuring and correlating 
different vocational abilities. But we might expect to get from them 
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similar results, for example, group factors for the engineering type 
of occupation, for the salesmanship type, the clerical type, and so on. 

FACTORIAL ANALYSIS 

The statistical investigations of Spearman (1927) and others have 
shown that it is possible to account for practically the whole of a 
set of test inter-correlations by postulating appropriate common 
factors.^ It is legitimate, then, to regard each test as made up of two 
components, or as measuring two things. In so far as it correlates 
with other tests, it is measuring some factor or factors. This is often 
referred to as the test’s comnnmality. But in so far as it fails to 
correlate (and most test inter-correlations, be it remembered, arc 
only moderately high), it is measuring a purely specific component, 
or something which is peculiar to that test alone and has no relation- 
ship to any other test. This is called the test’s specificity. It in 'ludes, 
of course, the error which is present in the test on account of im- 
perfect reliability*^ (cf. p. 129). Thus any test of an educational or 
vocational ability can be analysed into certain proportions of certain 
factors. Vov example, in Burt’s research, the arithmetic problems 
test can be analysed into so much general educational ability, so 
much arithmetical group factor (these together make up its com- 
munality), and, thirdly, a component specific to that particular test. 
The mechanical arithmetic test can be analysed into a rather smaller 
proportion of general factor, a large proportion of the arithmetic 
group factor, and a separate specific component. The presence of 
these specific components reduces the correlation between the two 
tests, but it is still quite high, namely ] 0-76, because they have both 
a general and a group factor in common. A test of another subject, 
such as handwoik, still correlates positi\cly with arithmetic prob- 
lems, because they aic both partially saturated with the general 
factor, but the coetlicicnt is a small one, namely ' 0-38, since hand- 
work embodies a dilTerent (manual) group factor, as well as a 
diflerent specific component. 

’ Note that we do not claim to be able to account for the yx/iolc of the correla- 
tions, for siieli correlations are, of course, al\\a\s liable to errors of sampling. 
Ofien the residual coenicienls, which aic left after a general factor and some 
group factors have been extracted, are so small i dative to their that it is 

not vvoilh while analvsmg them further. Lven with the most accurate techniques, 
such as Hotelling’s and Unit’s, the latei factc^is are gencrallv' so small that little 
significance can be altachetl to them. 

2 Thomson (1939) points out, however, that some of the inlluences producing 
unreliability, c.g. those of the function llucluation type, may actually be com- 
mon to several tests, and so contribute to their communality. 
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By means of factor analysis, then, it is possible to reduce a large 
number, of test results to a few underlying common factors. Any 
psychological test, or measure of a scholastic or occupational 
ability, can be regarded as made up of some of these factors, or as 
possessing a certain ‘factor pattern.’ Further, an individual person’s 
standing on each factor can be derived from his test scores, and he, 
too, possesses a certain factor pattern. Thus, a pupil who is high in 
general educational factor will be above average in most of his school 
work, but his relative goodness at different subjects will depend on 
the strength or weakness of his group and specific factors. 

Dangers oj Factor Analysis . — At this point a strong warning should 
be given against carrying the factorial conception of abilities too far. 
Factors are not entities in the mind whose nature or constitution, 
and whose strength or weakness, are immutably fixed. Some writers 
do seem to assume that factorial analysis is revealing the funda- 
mental elements of which human minds arc compounded, much as 
chemical analysis reveals the elements from which chemical sub- 
stances are compounded. We should realize that factors consist 
primarily of categories for classifying mental tests and examinations. 
Their value lies in the fact that they tell us objectively what coi relates 
or overlaps with what, and so they take the place of unverified 
faculties and other subjective conceptions of human traits and 
abilities. 

It is obviously illegitimate to compare factors in the educational 
field with chemical elements. For these factors, which arc deduced 
from the correlations between various school subjects, must depend 
largely on how the subjects are taught and on the stage which the 
pupils have reached. Different factor patterns will, therefore, be 
found m different school classes, according as the teachers connect 
up, or bring about overlap in their pupils’ minds between, the sub- 
jects. For example, Oldham (1937-8) found very diverse relation- 
ships between arithmetic, algebra, and geometry among different 
12- to 13-year-old school classes, which could be ascribed to differ- 
ences in teaching methods and in the pupils’ attitudes towards the 
subjects. Others have proved that the factors extracted from a set of 
mental tests alter progressively with practice at the tests. 

Like all other statistical constructs, factors will be liable to a good 
deal of variation if they are derived from measures of small numbers 
of persons. But apart from such sampling errors, they must be 
regarded as depending to a considerable extent on the particular 
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group of persons measured. Thomson (1939) shows that they are 
markedly affected by the heterogeneity or homogeneity of this 
group of persons. In addition, they are dependent upon the particular 
set of tests used, since the factor pattern of a test is in essence an 
expression of the relations between this test and all the other tests in 
the set. Statistical analysis of a single test in isolation is incapable of 
telling us what factors it contains (though subjective analysis, or 
inspection of its content, may often give useful hints as to its prob- 
able factor pattern, which can later be verified by studying ob- 
jectively its correlations with other tests). However, this limitation 
may not be very serious, since, as we shall see below, at least some of 
the most important factors emerge in almost identical form from a 
wide variety of sets of tests. 

Then there is the more subtle dilficully that the same set of tests 
can always be factorized in a great many different ways. Most of 
these ways will lead to quite illogical results, so that the investigator 
has to choose from among the various possible factorizations the 
most meaningful and convenient pattern. Take, for example, the 
educational tests applied by Burt. We have so far regarded these as 
analysable into general educational ability, four group factors for 
related subjects, and separate specific factors for each test. Now Burt 
showed, incidentally, that the general factor was not Identical with 
general intelligence as estimated by teachers, or as measured by 
intelligence tests, though the two overlapped closely. But it would 
have been quite legitimate to analyse mea^^ures of intelligence along 
with the educational measures, and to partial out intelligence first, 
so making it the principal factor underlying achievement. If this 
had been done, it is likely that a subsidiary general factor would 
have emerged representing the pupils’ application to, and interest 
in, their work in general. Intelligence plus this subsidiary factor 
would then make up what we have previously denoted as the 
general educational ability factor. The group factors would also need 
to be somewhat modified, since most of the fourth one (composition, 
history, etc.) might have been absorbed into, or accounted for by, 
the intelligence factor.^ 

In investigating students’ college marks, the writer obtained 
analogous results. The inter-correlations could be accounted for 
either by a very prominent general factor and group factors re- 

Hn his later study (1939a), Burt describes several methods of factorizing 
which do yield numerically different, though logically consistent, results. 
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stricted to a few subjects, or else by three factors, all of about equal 
prominence, each entering into several subjects. 

Certain other precautions which should be observed in inter- 
preting the results of factorial analyses will be mentioned later. 

Types of Factor Pattern . — Three main types of factor pattern 
recur very frequently in educational and psychological investigations. 
They may be represented by the following diagrams, each of which 
refers to twelve hypothetical tests, Nos. 1,2,... 
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In the first, which Ilol/ingcr and Hartman (1941) call the bi- 
faclor pattern, there is a general factor running through all the tests, 
four different group factors, and specifics. This pattern (or similar 
more complex ones) is usually adequate for representing the results 
of tests of abilities. For instance it applies excellently to Burt’s 
investigation. 

The second, multiple-factor, pattern requires three, four, or more 
factors, none of them general, but all running through several tests, 
to account for the test inter-correlations. It will be seen that Test 1 
involves factors A and C and a specific. But as far as possible each 
test is covered by a single factor (with only negligible loadings on the 
others). This type of pattern is referred to as Siniplc Structure (cf. 
Thomson, 1939; Thurstone, 1947). It is often used in analysing tests 
of abilities, but is still more appropriate where tests of personality, 
character, or temperament are concerned, since in these fields a 
general trait, running through all the tests, would seldom be appro- 
priate. For example, Thurstone (1931) obtained measures of the 
interests of a group of students in a large number of different occu- 
pations — advertising, art, chemistry, etc. — and found that the interest 
scores could be classified under four main factors. These he roughly 
identified as interest in language, in science, in business, and in 
people. Each separate interest could be made up of appropriate 
proportions of these four factors. Other instances have been sur- 
veyed by Eysenck (1953). 

Spearman's Two-factor Theory . — The third, unitary-factor, pattern 



142 THE MEASUREMENT OF ABILITIES 

has aroused a great deal of controversy. Historically it was the first 
type to be discovered. Early in the present century, Spearman found 
that diverse tests of mental abilities usually gave inter-correlations 
which could be wholly accounted for (within the limits of their 
errors of sampling) by a single general factor plus specific factors. 
He enunciated certain criteria to which the correlations must con- 
form if this is to be possible, the most important of which is called 
the tetrad difference criterion.^ The general factor obviously cor- 
responds to what we commonly mean by general intelligence, but 
Spearman preferred to symbolize it by the letter g, the specific factors 
by letter .yj, So, . . etc., so as to get over the difficulty, mentioned 
above, that the term ‘intelligence’ has been so diversely defined by 
different psychologists. His g factor, it was claimed, would be the 
same whatever the tests used. It could be measured, for example, by 
a battery of tests consisting of Analogies, Completion, Vocabulary, 
and Directions, or equally well by a battery consisting of practical per- 
formance tests (pp. 160-2). Each test would have its own independent 
.9, but in so far as it overlapped with other tests, it would be measur- 
ing g. This view is widely known as the Two-factor Theory. 

For a while it was thought that the Two-factor Theory would 
apply to all tests of abilities, not only to intellectual ones, and that 
very few additional group factois would be needed. Only occasion- 
ally would a few tests which were very similar in content (e.g. tests 
of musical aptitude) be found to inter-correlate over and above their 
correlation due to g. 

Different tests may differ considerably in their g-saturations. 
Those with the highest saturation seem to involve the capacity for 
seeing relationships, whilst tests which involve mainly mechanical or 
rote memory contain very little g. Among school subjects, classics is 
most dependent on g; French, English, history, and the like are also 
highly saturated. Manual subjects and music have much more 
prominent .s’s and relatively little g. Thus wc find high correlations 
between classics and English, low correlations between handwork 
and music, and moderate correlations between classics and hand- 
work, or English and music. The Two-factor Theory not only 
explains these and similar facts as to the relations between abilities, 

^ For an explanation, see Knight (1933), or Spearman’s own book (1927). 
The apparent discrepancy between the names— Two-factor Theory and Unitary- 
factor Pattern — is merely due to the fact that Spearman includes the specific 
components as factors, whereas the terms we have used refer only to the com- 
munality of the factorized tests. 
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but also provides the mental tester with a scientific basis for choosing 
tests which will give the best measures of g. Although ever>' sub-test 
in a battery of intelligence tests will involve a certain ^-factor, tests 
can be selected in which these s's are small, and when the tests are 
combined the ^’s will tend to cancel out, so leaving an almost pure 
measure of g. 

Criticisms of the Two-factor Theory . — The theory has been sub- 
jected to much criticism, both on psychological and on statistical 
grounds. Spearman suggested that g might correspond to the con- 
ception of ‘general mental energy.’ Whether or not this is a psycho- 
logically sound explanation need not concern us here. On the 
statistical side, Thomson (1939) and others pointed out that, al- 
though the analysis into general and specific factors is legitimate 
when the tetrad difference criterion is satisfied, yet this is not the 
only possible mode of analysis; that if mental abilities consisted of 
large numbers of overlapping group factors, the same criterion 
would still be satisfied. Thomson’s view would seem to the present 
writer to be preferable from the psychological standpoint, but 
Spearman’s view is perhaps more practically convenient, since we do 
already in everyday life assume the existence of something like g — 
whether we call it ‘brains,’ ‘brightness,’ ‘intelligence,’ or what not — 
and the Two-factor Theory provides a scientific basis for this com- 
mon-sense assumption. 

The theory, and the experiments which it has stimulated, at one 
time seemed to discredit all the mental faculties which had so often 
been abused in traditional psychology and in common parlance. But 
more recent research with modern factorial methods indicates that 
the theory requires considerable modifications in the matter of 
group factors. A large variety of group, or multiple, factors has been 
described, particularly in American researches, and some of them do 
resemble the older faculties. Unfortunately, there arc considerable 
disagreements between different factorists as to which arc the most 
distinctive and fundamental, but the series that Thurstonc established 
is very widely accepted. Thurstonc (1938) applied an extensive bat- 
tery of psychological tests to university students and found that they 
resolved into factors which he identified as verbal ability (K), 
number ability (A), word fluency (IV), spatial judgment (S), rote 
memory (M), perceptual speed (P), and deductive and inductive 
reasoning (D, /). A fairly similar picture was obtained in studies of 
children, even down to the 5- to 6-year level, and Thurstonc has issued 
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sets of tests for measuring these factors at several age levels, namely 
the P.M.A. or Primary Mental Abilities tests. Note that he does not 
allow for a g running through all these abilities, though Spearman 
and his followers have shown that a pattern of general + group 
factors fits the same results as well as, or better than, a pattern of 
independent multiple factors.^ 

It may safely be concluded that no test measures nothing but g and 
a specific factor, since the type of teit material employed always 
introduces some additional common clement. All verbal intelligence 
tests, together with other tests depending on manipulations of words, 
involve a verbal and educational factor, which has been named V or 
V .* ed. Those based on multiple-choice items, to be done at speed, 
embody a further factor or factors, not present in iinspecded, 
creative-response tests such as Terman-Merrill. In order to escape 
the influence of v; cd, many mental tests have been constructed from 
abstract diagrams or pictures. Such tests, however, in so far as they 
involve spatial imagination or the mental manipulation of shapes, 
yielda spatial factor called A* (FI Koussy, 1935),orA(Thurstone, 1938). 
Certainly some, and possibly all, of the common non-verbal per- 
formance tests bring in this same A, and various minor dexterity 
factors in addition. Mechanical ability tests too measure together 
with a mechanical information or experience factor, ni. School 
examinations and educational attainments tests depend (as Burt’s 
research showed) not only on g and v : cd, but also on a factor which 
represents the pupils’ interest in and altitudes to work, and their 
character or temperament qualities. Following Alexander (1935), 
this is often referred to as'the X-factor. In addition, there are probably 
group faciors for families of school subjccls such as the linguistic, 
scientific and mathematical, practical and manual, and further sub- 
factors largely governed, as we have seen, by teaching methods. 

Educational and Vocational Applications . — The findings of fac- 
torial research have important educational and vocational bear- 
ings. If the Two-factor Theory was exclusively accepted, educa- 
tional and vocational guidance by means of tests would scarcely be 
possible. If we wished to advise a child as to the type of education 

^ A more detailed discussion of the pros and cons of the g 4 group-factor and 
the multiple-factor approaches is given elsewhere (Vernon, 1950), and it is shown 
that they can fairly readily be reconciled. 

* Alexander (1935) claimed the existence of a distinctive practical factor, F, 
in certain performance tests; but later work appears to indicate that it cannot be 
differentiated from A. 
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or the occupation for which he would be most suited, we could 
apply tests of g and so deduce the general inlcllcclual lev:l of the 
work he should lake up. But we could not dccide^vvhether he would 
be likely to do belter at classics or science, at a clerical job or an 
engineering trade, and so on, except by giving a vast number of 
tests of different s factors, or by letting him try each type of work in 
turn so as to find by practical experience which spccihe one suited 
him. There could be no recommending of general lines of occupation. 
To a certain extent this does seem to be true. Educational and 
vocational guidance among adolescents are extremely dilhciili and 
uncertain. They have to be based more upon a subjective analysis of 
interests and character traits, background and relevant training, than 
upon objective predictions by tests. Yet common cxpciicnce strongly 
suggests that there do exist general categories or types of work, and 
that therefore a number of group factors, additional to /?/, could be 
established by further research. If this were done, guidance would 
become much more scientific. Full consideration ol' a candidate's 
interests and character would, of course, still be rcquiicd, but his 
scores on a series of tests for each of these factors would give 
objective indications of his piobable ability along each of the main 
lines of work.^ Much the same is true in clinical work with malad- 
justed children or neurotic or psychotic adults. Psychiatrists and 
clinical psychologists often claim, on the basis of poorly validated 
tests, that that their patients show deterioration or weakness in such 
faculties as memory, orientation, attention, \ isuali/alion, etc. Factor 
analysis indicates that most of these tests arc merelv rather un- 
reliable measures of .g, or g \ v : cd. But at the same time it can be 

^ Thomson (1939) has pointed out thtil wlien tests are vu\en with tlie ohjeet 
of predicting ediieational or vocational achie\cmenl, the e\li action ol faclois 
from the tests may cimstitiite quite an unnecessary intermediate step. A much 
more etheient method is to correlate the tests directly with measures of achieve- 
ment and then to use multiple coiielation for obtaining the most accurate pre- 
dictions. While this view is, ofcouise, perfectly true from the statistical stand- 
point, it would seem to the present w liter to be praciicall> applicable only in 
educational and vocational selection, not in guidance. It should certainly be 
used by the tester who wishes to select the best candidates for some single ana 
definite educational course or |ob. Hut the problems of the voc.itional ad\iser 
are much too complex to be tackled as yet by this exact method. He i> loiced to 
predict a candidate’s probable achievement in generali/ed teims, or factors. \t 
the present time he is doing so largely in terms of unverilied factor^, derived 
from subjective considerations of tests and lines of work. I he view put foiwaid 
above is that his procedure could be made more systematic and scientific if moie 
factors were experimentally established. 
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used to isolate definite group factors such as mental speed, and 
flexibility in the formation of concepts, which are especially apt to 
be affected by abnormal mental states; and it can show which tests 
are the most promising for measuring such factors. 

While we have a fairly clear conception of the main factors under- 
lying the abilities of school-children and of unselected adults, the 
picture among older adolescents and adults of superior ability is 
much more complex and obscure. A g-factor can still be extracted 
from their mental test results, but it seems to play a decreasingly 
important part in educational or vocational achievement, and inter- 
ests, work altitudes, and temperament traits become more influential. 
An important series of researches by Guilford 1 950-4) in America 

with high-grade groups such as air-force pilots and university 
students suggests that numerous intellectual factors can be dis- 
tinguished in the fields of reasoning, planning, judgment, originality, 
etc. ; and that it would be more profitable to develop tests for these 
than for a vaguely conceived general capacity of intelligence. Which 
of them will emerge as consistent and clear-cut, and as educationally 
or vocationally important, among other groups (e.g. British gram- 
mar school pupils and students) is still a matter of doubt. 

Conclusions about Factor Analysis . — A great deal of unavoidably 
complex and expensive research will be needed before we can hope 
to arrive at a reasonably complete map, as it were, of the whole 
realm of human abilities. The difficulties are enhanced because, as 
we have seen, the factor patterns which an investigator extracts arc 
often dependent on the particular tests and particular testces he 
employs, and because there is much room for subjective judgment in 
the type of factor pattern he chooses. Research at present is rather 
badly co-ordinated, since each investigator sets out to explore fresh 
territory instead of trying to build on the factors already fairly well 
established by previous workers. Further, there is always the danger 
of mathematical rather than psychological considerations getting 
the upper hand. As already mentioned, factorial investigations arc 
not revealing the elements out of which the mind is compounded, 
but are classifying test performances with a view to predicting more 
effectively other performances, e.g. in the educational or vocational 
sphere. But by no means all of the important facets of human 
nature are susceptible yet of accurate measurement. There may be 
abilities, such as executive capacity, intuitive power, goodness of 
teaching, and social intelligence, which are generally recognized in 
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daily life, but which are not yet properly substantiated or analysed 
owing to the inadequacies of our tests. Still more is this true in the 
realm of emotional traits and interests, for reasons which the writer 
has described elsewhere (Vernon, 1953). Yet we cannot hope to 
make successful predictions about an individual child or adult 
unless we know every side of his make-up. It may well be that all 
these attempts at analysing the measurable abilities and traits of 
human beings will never give us a picture of an individual personality 
considered as a whole, that we shall always need to supplement test 
results with subjective judgment and intuition in order to resynthesize 
them into a living entity. But this, too, is a matter of psychological 
controversy which the reader may follow up elsewhere. 

We have no space to discuss the actual statistical techniques of 
factor analysis. Spearman has fully described his method in The 
Abilities of Man (1927). It is not difTicult to apply, though rather 
laborious. It is the most accurate method of extracting unitary- 
factor patterns, but it is hardly suitable for bi- and multiple-factor 
patterns. Holzingcr’s (1941) method is an extension which deals 
readily with bi-factor patterns, though it docs not seem to possess 
much advantage over the flexible group-factor techniques outlined 
by Burt (1950). The most widely used technique is Thurstone’s 
centroid method, which is wcll-descnbed by Guilford (1936), and 
is not difficult to acquire. But it has many more elaborate refine- 
ments, for which Thurstonc’s book (1947), or Cattcll (1952), should 
be consulted. Other more accuiatc techniques, whose aims and 
guiding principles are somewhat diflerent from those described in 
the present chapter, have been developed by Hotelling, Kelley, and 
Burt. The clearest account of the whole subject is that of Thomson 
(1939). 
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MENTAL TESTS 

MfcNTAL testing is simply a more refined and scientific method of 
doing something which wc all of us do every day of our lives, that 
is assessing one another’s abilities and character traits. Our ordinary 
method is to observe a person’s behaviour, including his facial 
expressions and gestures, in various situations and circumstances, 
and his speech or written productions. For example, wc might watch 
a small boy playing with bricks, and from his skilfulncss, or the 
elaborateness of the resulting construction, jump to the conclusion 
that he is quite a bright child. Obviously the basis for this and other 
such judgments is very haphazard and unreliable. There is no sure 
evidence that the boy is acting more, or less, intelligently than the 
majority of children. Nevertheless this performance may quite well 
be refined and turned into a scientific mental test. So-called per- 
formance tests of intelligence arc based on just such activities with 
bricks, blocks, and other concrete material. 

Let us then first consider the essential features of a mental test 
which distinguish it from such everyday judgments and from the 
slightly more scientific school examination. 

1. Standardization oj Test Situation . — The first step which the 
tester takes is to standardize the test situation. He will, for example, 
adopt some standard set of bricks and give definite instructions, so 
that all children who are subjected to the lest shall do it under as 
nearly as possible identical circumstances. Until this is done it is not 
legitimate to compare the performances of dilVercnt children. With 
every good test there is published a detailed set of instructions, to 
which the tester must closely adhere. School and university ex- 
aminers, on the other hand, usually give no instructions beyond 
stating the number of questions to be answered, and they neglect to 
inform the examinees what type of answers they require. Because of 
this, and because of frequent ambiguities in the wording of questions, 
different examinees may conceive the tasks that are set them in a 
variety of different ways, and it becomes exceedingly difficult to 
compare and mark their answers. 
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The material of a test, its conditions of application, and its in- 
structions are further planned to be readily comprehensible to, and 
suitable to the mental level of, the children or adults for whom it is 
intended. This is another feature which some school examiners 
might be advised to copy instead of setting subjects for composi- 
tions, or other questions, which are far above the heads of the 
pupils. 

We must, however, admit that several widely used tests do not 
entirely live up to these criteria. Some of them were originally con- 
structed and standardized in America; hence they require consider- 
able modification or translation before they are suitable for British 
children. We are forced to employ these ‘second-hand’ tests because 
the production of fresh tests is an extremely lengthy and expensive 
business. Unfortunately different British testers do the necessary 
‘re-modelling’ in different ways, so that the test situation is no longer 
identical for all. 

2. Standardization of Scoring . — ^The mental tester does not ob- 
serve the child’s behaviour, or his finished product, in the casual 
manner implied by our illustration, but obtains an accurate record 
of them. He insists on knowing definitely what was the test response, 
i.e. the child’s reaction to the previously laid down test situation. 
Either the response is something which can be delimited in quantita- 
tive terms (c.g, so many bricks put into their correct positions in so 
many seconds), or else it is something which can be designated un- 
equivocally as right or wrong.^ It is most important that the personal 
opinion of the tester should play no part in this record. Any two 
testers observing and scoring the response should arrive at identical 
records. If this condition holds the test is called an objective one. All 
good mental tests, then, are provided with scoring keys or manuals 
of instructions which enable the recording and scoring to be done 
objectively. 

Our everyday judgments of traits or abilities are, by contrast, 
decidedly subjective, for numerous investigations have demon- 
strated their liability to vary with the person who makes them. Some 
teachers are still convinced that they can assess the intelligence of 
their pupils better than can any test. But the fact is that different 
teachers, assessing the same pupils, give such different opinions that 
the scientific psychologist can put little trust in any of them. The 

^ So-called Quality Scales are tests whose scoring constitutes an exception to 
this rule, cf. p. 154. 

M.A.— 11 
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same is true of many school and other examination marks, as will 
appear in Chap. XI. 

3. Norms . — The layman commonly seems to regard the ob- 
served behaviour as having some absolute value or significance. 
There is no doubt in his mind that it indicates high intelligence, or 
poor sociability, or whatever the trait may be. The more cautious 
psychologist realizes that such judgments must be based largely on 
more or less vague recollections of the behaviour of other similar 
children under similar circumstances, and so insists that all grading 
of abilities is essentially a comparative or relative matter (cf. Chaps. 
II and IV). Thus one of the most vital accompaniments of a mental 
test is the norms, by reference to which any child’s goodness or 
poorness of performance may be assessed. Unfortunately it is also 
the feature in which many published tests are defective, on account 
of the difficulties already described (pp. 66-73). 

In the nature of examining it is hardly possible to establish norms 
for examinations, since they cannot be applied to large numbers of 
pupils or students beforehand for this purpose, as is the standardized 
test. We have seen, however, that the process of rationalizing marks, 
which is essentially equivalent to the setting up of provisional 
norms, is necessary if the present ambiguities in the significance of 
school or examinations marks arc to be avoided. 

4. Reliability and Validity , — A single piece of behaviour, 
casually observed, as in our illustration, would probably be very low 
in reliability. A discussion of the statistical methods of determining 
reliability was given above (p. 126), and the reliability of several of 
the more imporlant -tests is mentioned later in this chapter. 

The most serious defect in the layman’s everyday judgments is that 
he seldom has any good proof that the observed behaviour validly 
indicates the ability which he deduces from it. It may have arisen 
from some quite dillcrent trait or ability, or else from an ability 
which is almost wholly specific (in Spearman’s sense), i.e. not corre- 
lated with anything else. To be reliable a test need only measure 
accurately the ability to perform that test, but to be valid it must also 
correlate effectively with other measures of whatever it is supposed 
to test. The mere inspection of the content of a test is seldom sufficient 
for the determination of its validity, although it is a necessary 
preliminary. 

Such inspection of educational tests and examinations may reveal 
their failure to cover the ground which the pupils should have 
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accomplished. Thus school and university examinations are often de- 
fective because their questions arc loo few in number, or are badly set. 
It will be obvious also to a tester that the content of some of the pub- 
lished educational tests is unsatisfactory, either because they are 
designed for pupils whose tuition differs widely from that current 
among his testees, or because they arc out-of-date. But apart from 
such flaws, many tests involve group or specific factors which detract 
from their effectiveness as measures of some general ability. J or 
example, a test of reading may be based on the capacity for pro- 
nouncing difficult words. This may or may not indicate children’s 
capacities in other aspects of reading, such as their fluency, or their 
comprehension of what they have read. Investigation by means of 
correlations is essential for establishing these points. Again, the 
technique of the test, i.e. the way in which the testee is required to 
express his ability, always seems to introduce an irrelevant factor. 
In Chap. XI we shall show that in an ordinary essay-type of c\am- 
ination, the translation of the examinees’ knowledge into cssa>-form 
is one such factor, and in Chap. XII the examiner’s translation of his 
questions into new-type or objective form will be recognized as an- 
other. temperamental factors, health, fatigue, and the like also 
influence a testee’s performances at many tests, and so reduce their 
validity as measures of ability. 

Most educational and vocational tests are assumed not onl> to 
be valid indicators of some present general ability, but also to be 
prognostic of future achievement. I’heir predictions should then, of 
course, be followed up and compared with siibscqiientl) obtained 
measures of achievement. In the educational field there have been 
many such investigations whose results usually show our tests and 
examinations to possess disappointingly poor validity (cf. Chap. XI). 
Vocational psychologists have also tiied to corielate their predictive 
tests with some measure of output, or with estimates of elhcicncy 
obtained from supervisors or foremen. As shown elsewhere (Vernon 
and Parry, 1949), however, it is not always easy to secure a satisfac- 
tory criterion of success or failure in an occupation against which 
the test results can be checked. 

The validation of intelligence tests, or of tests of capacities such as 
practical ability, verbal ability, and the like, is particularly diflicult, 
since we possess no objective criterion of intelligence, etc., with 
which to compare them. Ratings have sometimes been used, that is, 
teachers’ estimates of the intelligence of children in their charge, but 
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these are highly fallible. If psychologists are unable to agree as to 
what they mean by intelligence, it is unlikely that laymen’s views will 
be any more concordant. Indeed, if teachers could judge it accurately, 
there would be no need for tests. The main defect of such ratings is 
that they are strongly biased (often quite unwittingly) by considera- 
tions such as the children’s industriousness and good school be- 
haviour. Parents’ judgments are still less trustworthy. Several group 
intelligence tests have been validated by comparing their results with 
the results of one of the Binet scales. But the validity of these scales 
is open to doubt. The items which they contain have always been 
chosen so as to diflerentiate older (hence presumably more intelligent) 
children from younger (less intelligent) ones. This criterion is, how- 
ever, dubious in value, since it would admit tests of, say, height and 
weight, which have very little validity as indicators of intelligence. Yet 
we do not rely merely on subjective judgment when we state that these 
must be excluded from a test of intelligence, for in the absence of 
any external criterion of validity, we make use of what has been called 
the method of internal consistency. 

This method consists cither of factorial analysis, or of a modifica- 
tion thereof. All the test items or sub-tests arc selected m the first 
place on the basis of subjective opinion ; they must appear to the 
tester to involve the exercise of intellect. Thereafter they are studied 
empirically. Either each item is compared with every other, or with 
the sum of all the rest of the items, or sets of items such as the sub- 
tests in a group intelligence test are compared with other sets. The 
inter-correlations will serve to show whether or not all the items are 
consistent with one another — whether they are measuring one and 
the same thing. And items that fail to conform to this criterion are 
excluded from the final, published, version of the lest. The statistical 
conditions imposed by Spearman m his factorial analysis of group 
intelligence tests are much more rigorous than those used by Terman 
in revising the Binet scale. Both, however, prove that there is a gen- 
eral factor running through all their test items or sub-tests. Spearman 
and his followers wish to measure nothing but g, that is, they aim at 
a uni-factor pattern, and so exclude a number of sub-tests which fail 
to conform to such a pattern. The Binet test, on the other hand, al- 
though it certainly embodies a g-factor, contains a hotch-potch of 
heterogeneous tasks which would, if fully analysed, yield several 
group factors (a bi-factor pattern), or even a set of multiple factors 
(cf. Burt, 1939b). The effects of these different techniques of test 



MENTAL TESTS 153 

construction will be discussed more fully below. What matters for 
our present purpose is that when a test, or a set of tests, is internally 
consistent, this signifies that it will validly predict any other be- 
haviour which involves the same general factor. Intelligence tests 
work fairly effectively because scholastic success, and many of the 
other intellectual activities of everyday life, are known to be depen- 
dent largely on the g which they measure. 

Test Construction. — Before proceeding, we would strongly urge 
amateur testers and teachers not to attempt to make up their own 
tests, apart from new-type ones for school examination purposes. To 
devise test items similar to those which appear in published tests is 
not difficult, but a great many technical features enter into the selec- 
tion of suitable items and into their standardization, in addition to 
those described here. The points which we have outlined are meant, 
not to show how to construct a test, but to enable amateurs to choose 
more wisely the tests which will be most suitable for their purposes, 
and to give them some assistance in understanding what they are 
doing when they apply published tests. 

TYPrS OF TESTS 

At the end of this chapter is given a classified list of most of the 
mental tests available in this country. The main headings arc: 

T. Attainment, Achievement, or Educational Tests 

A. Examinations. 

B. Standardized Educational Tests. 

1. Individual reading tests. 

2. Language development and Vocabulary. 

3. Individual oral arithmetic. 

4. Group tests. 

11. Individual Intelligence Tests. 

A. Versions of the Binet-Simon scale. 

B. Wcchsler scales. 

C. Miscellaneous scales. 

D-E. Performance Tests. 

III. Group Intelligence Tests. 

A. Oral. 

B. Non-verbal. 

C. Verbal. 
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IV. Aptitude and Special Ability Tests. 

V. Personality Tests and Assessments. 

The last two sections merely mention some of the tests commonly 
applied in this country, but the first three aim to he comprehensive. 
Although we shall not attempt to describe particular tests, the follow- 
ing general explanations of, and comments on, the main types may be 
useful to those who arc not already thoroughly familiar with the field. 

Examinations and Educational Tests . — Examinations are still the 
most important method of educational measurement, and so come 
at the head of our list. Their defects, and the possibilities of improv- 
ing them, or of employing substitutes, arc discussed in Chaps. XI- 
XTV. Standardized educational tests must be contrasted with them, 
for although they arc made up of the same kinds of items or questions 
as occur in objective or new-type examinations, their functions are 
quite different. They arc not designed to cover the work done by 
particular clas.scs of pupils or students. Nor arc they meant, pri- 
marily, for selecting pupils who are most likely to do well in some 
future work, though they are nowadays quite commonly used for 
assessing attainments at the 11 1 selection stage. Rather they serve 
to show what is the standing of a class, or of an individual pupil, 
relative to other similar pupils, cither on some general subject such 
as English or arithmetic, or on special branches of, or stages within, 
a subject fthese latter being known as diagnostic,^ or analytic, tests). 
The fact that they are published, and their expense, obviously make 
them unsuitable for use as school examinations. By means of them, 
however, a teacher can luscfully grade his pupils at the beginning of 
the school year, so as to find which ones will need most coaching, or 
in which parts of a subject they arc most advanced or backward. If 
the norms are adequate (cf. Chap. IV), they will tell him how his 
pupils compare with pupil ', in general of the same age. The psycholo- 
gist at a Child Guidance Clinic, or the vocational adviser, finds them 
especially valuable for assessing a child's strong and weak points. 

Quality Scales . — Although most educational tests consist of large 
numbers of short items, both so as to cover the subject as thoroughly 
as possible, and so as to admit of objective scoring (cf. Chap. XII), 
some educational products which arc too complex to be assessed in 
this piecemeal fashion can nevertheless be measured by ‘quality’ or 

^ Cf . the excellent discussion of diagnostic tests by Schonell (1950); also 
Whilde (1955). 
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product scales.’ Such standardized scales are available for grading, 
handwriting, composition, drawing, etc. Each of these consists of a 
scries of specimen products which have been carefully chosen as 
representative of certain levels of achievement. The testee is told to 
write or draw the same matter in his normal manner, and his product 
is, as it were, slid along the scale, or compared with the specimens, 
until one is found to which it corresponds in achievement. It is then 
given the score of this specimen. Such grading cannot be made 
wholly objective, but there is evidence that it improves with practice. 
DiiTcrent testers will, when experienced, award the same or nearly 
the same score to a child’s product. Certainly the process is fairer 
than grading without any external standards. Hence the adoption of a 
similar plan in the marking of school work was advocated in Chap. JI. 

Burt’s scales all consist of specimens selected from a large number 
as being typical of the work of average 5 1 ,6 t , . . . 14 child- 
ren. Schoncll’s composition scales, covering the 7- to 14-ycar range, 
are similar. Cattell’s handwriting scale for school-leavers and his 
drawing scale for adults are merely graded 1 to V, representing 
different degrees of merit at these age levels. 

Quantitative Tests and Scales . — The scoring of quantitative tests 
cither by the ‘rate’ or ‘power’ systems, or by a mixture of these, has 
already been described in Chap. U. It will be noted that scarcely any 
tests arc available much above the 14-ycar level, and that very few 
deal with subjects outside the 3 R’s. One ob\ious reason for this is 
that secondary school curricula vary so widely, and that only rela- 
tively small numbers study a foreign language or a science, etc., by 
comparable methods at comparable levels. Nevertheless it would not 
be dilficult to devise and standardize tests in a number of subjects 
suitable for some fairly homogeneous groups of pupils or students, 
such as those in modern or in grammar schools, in teacher training 
colleges, or provincial universities. Certainly our tests lag far behind 
those available for similar purposes in America. 

INTFiLLrCiKNCE TESTS 

Intelligence tests closely resemble quantitative educational tests 
(to which they are, of course, historically prior) in their construction 
and scoring. Thus they may consist of: 

(a) A series of tasks graded in difficulty (power plan). 

(b) Tasks such as performance tests where the speed or number 
of moves are recorded (rate plan). 
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(c) Sets of questions which are roughly graded, but where the 
score depends on the number answered within a certain time limit. 

Group tests generally include a ‘battery’ of some half-dozen (be- 
tween about four and twelve) ‘sub-tests,’ each containing some twenty 
(between ten and fifty) questions or items. Alternatively they are 
arranged in ‘omnibus’ form, where items of all kinds are mixed up. 
In the battery type, each sub-test has its own instructions and sample 
items, and is timed separately. In the omnibus form, the instructions 
and samples may be printed along with the questions, or included 
in a preliminary practice sheet, and one time limit suffices for the 
whole test. An omnibus test is therefore somewhat easier to give, but 
a test battery has the advantage of providing rest pauses every few 
minutes. 

Almost all intelligence tests thus consist of rather large numbers 
of items. Performance tests are exceptional, but it is desirable to 
apply several, say six or more, of these. The tests are always applied 
and scored in a standard manner, and the scores either take the 
form of Mental Age years (as in the Binet and some other tests), or 
percentiles or standard score l.Q.s (cf. p. 67). 

The Innateness of Intelligence . — Many of the standard textbooks 
describe intelligence as a general innate capacity underlying all our 
abilities, dependent on the genes that we inherit, and therefore fairly 
constant throughout life. Thus intelligence tests differ, it is said, 
from attainments tests in that they consist of tasks whose solution 
depends upon native ability rather than upon acquired experience 
or upbringing. This view, however, seems rather unrealistic. For 
although there is ample 'evidence of the existence of inherited 
differences in intellectual potentiality, we are quite unable to observe 
or measure such a capacity. The good or poor intelligence which we 
observe in everyday life, at school or at work, and which we try to 
measure by our tests, is the product of innate factors and environ- 
ment. As Hebb (1948) and Piaget (1950) have shown, the young 
child gradually builds up or acquires his perceptions and habits and 
his thinking through contacts with his physical and social environ- 
ment ; and the level that he reaches at any age is affected considerably 
by the kind and amount of intellectual stimulation that the environ- 
ment has provided up to that age. This level is fairly stable under 
normal conditions of upbringing — more so than would be gathered 
from the publications of some left-wing theorists and of some 
extremely environmentalistic American investigators. But it does 
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fluctuate more than would be expected from the statements of some 
of the early pioneers of mental testing. For example, over*the prim- 
ary school age range from 6 to 11, or over the range from 11 to 18 
or so, the correlation between two similar tests is very far from per- 
fect, but it docs not sink below + 0-70. According to the Probable 
Error of estimate technique (p. 127), this means that the average or 
median child alters by about 7 I.Q. points up or down on retesting. 
Many are more stable than this, but a very few show much larger 
gains or losses of 30 or more points (4 x P.E.); and 17% are liable 
to alter 15 or more points either way.’ At the same time the stability 
is great enough to allow us to make useful educational or vocational 
predictions, for the majority of children, over several years. Par- 
ticularly striking evidence of this fact is provided by Tcrman and 
Oden’s (1947) follow-up of children with very high I.Q.s over 25 
years. A fuller discussion of these controversial questions, and of 
the psychological nature of intelligence, may be found elsewhere 
(Vernon, 1960). 

Thus a better way of looking at intelligence and attainments is to 
say that the former refers to the more general qualities of thinking — 
comprehension, level of concept development, reasoning and grasp- 
ing relations — qualities which seem to be acquired largely in the 
course of normal development without spceific tuition ; whereas the 
latter refer more to knowledge and skills which are directly trained, 
and whose acquisition and retention depend more on the person's 
interests and on his personality traits such as industriousness. There 
is no sharp dividing line between them. Indeed some items such as 
vocabulary are to be found either in intelligence or in English attain- 
ment tests. Again some of the items m the Binet and Wechsler 
scales, and in such well-known group tests as Army Alpha, are 
based on general knowledge or information. Although such know- 
ledge is obviously acquired, these work very well as measures of 
intelligence, since the more intelligent are more likely to pick it up, 
in or out of school, and to retain it.^ I.Q. and E.Q. always show very 

’ With repeated retestings variations are greater, and they may be increased 
by numerous factors such as above average ability, standard deviations greater 
than 15, inaccurate norms, or practice and coaching effects. They are much 
larger, also, if the first test is given before the age of 6. Indeed so-called develop- 
mental quotients obtained from tests of 0 to 2] -year-olds have virtually no 
predictive value for later I.Q. 

2 A more serious criticism is that girls and women tend to be considerably 
poorer in general knowledge than boys and men ; and it is preferable to use items 
for intelligence tests which show no large sex differences. 
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close overlapping, though some children do come out markedly 
better on one than the other ; and investigations such as Burt’s (1937) 
and Schonell’s (1942) into educational backwardness reveal that 
those with relatively low attainment arc usually handicapped by 
bad home conditions, poor health, inclTcctivc teaching, or emotional 
maladjustment, or a combination of these. Thus even though we must 
admit that intelligence and the I.Q. are only partly innately deter- 
mined, it is still true that they give better indications of capacity for 
acquiring further education than do present attainments alone. 
Intelligence tests arc more fair to children from remote rural 
schools, and others whose schooling has been afTected, e.g. by 
absences, and to children of inferior background; although all these 
factors themselves do have adverse elTecls on the I.Q. 

The majority of tests employ the verbal medium since man’s 
higher thinking and reasoning capacities generally involve the use 
of words, numbers, or other symbols. Moreover one of the most 
important, if not the most important, use of intelligence tests is that 
of predicting scholastic aptitude, and as school work is itself pre- 
dominantly verbal, verbal tests are certainly the most useful. But 
this means that people’s performance at such tests is considerably 
affected if they have had inadequate opportunities for acquiring 
language, for example the foreign-born or bilingual, the deaf, or 
children of gipsies or canal bargees who receive little or no schooling. 
Many testers therefore prefer non-verbal tests, consisting of prob- 
lems based on pictures, abstract diagrams or concrete (performance) 
material. While these may be more fair to the verbally handicapped, 
they cannot be considered' as giving a truer measure of innate 
ability. A British or American child, for example, has much more 
opportunity than an African child for acquiring skills with pictures 
and blocks, and a middle-class British child probably has advantages 
over a lower working-class child. Nadel (1939) and Biesheuvel (1949) 
show convincingly the fallacies of assuming that racial differences in 
intelligence can be measured by such tests, or indeed that any type 
of test is ‘culture free.’ Tests of the concrete or pictorial type are 
valuable not only for the deaf and the illiterate, but also for younger 
children who are less interested in verbal tests and less accustomed 
to verbal tasks than older persons. But these, and other non-verbal, 
tests give decidedly poorer predictions of educational capacity, 
except possibly in the technical, scientific, and mathematical fields. 

Individual Tests , — Up to 1937, the majority of testers of individual 
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children used the Stanford Revision of the Binct-Simon scale, pub- 
lished by Terman in 1916, or Burt’s unpublished restandardization 
of this Revision. The New Stanford Revision, or Tcrman-Merrill 
scale then became the most popular version (Terman and Merrill, 
1937). This is superior both because it contains very full instructions 
for application and scoring, which help to obviate the confusions 
and errors previously so common among inexperienced testers, and 
because of its greater extensiveness. It is issued in two forms, L and 
M, both of which cover almost the whole range of intelligence from 
that of 1^-year children to that of adults. At the same lime it has its 
diniciillies and weaknesses: 

(r/) The original American versions cannot be applied as they 
stand in this country, because they are replete with words and phrases 
that require ‘translation’ into English. Some of these have been 
allered in the edition published here, but many other desirable 
modifications have been suggested. Of the latter, the simplest and 
the most accessible arc those prepared by the Scottish Council for 
Research in Education (Kennedy-Fraser, 1945). Again the typical 
responses quoted in the scoring instructions are of little use since, 
even after translation, they represent the thoughts and expressions 
of American children. The tester must try to decide whether his 
testccs’ responses correspond in intellectual level to these accept- 
able and unacceptable responses. 

{h) None of the British versions has been restandardized. Thus 
the tester must expect to find that many of the tests arc misplaced in 
order of ditliculty. Children may fail all the tests in one year, and 
yet pass one or two in a higher year. Hie same applies to items within 
a test. Thus the Vocabulary words and some of the Absurdities, 
etc., require rearrangement. 

(c) It is ditlicult to obtain toys and other practical material, 
needed for tests of younger children, which are precisely comparable 
to the originals, and this material is very expensive. Eortunately it is 
used only at Mental Ages of 4 and less. For almost all children of 
primary school age, the printed cards and beads and blocks are 
sutheient. 

(d) Most of the tests near the top end of the scale, labelled Aver- 
age or Superior Adult (M.A.s 15-22), are too childish in content to 
be suitable for adults, although they are useful for discriminating 
among children of above average and very superior intelligence. 

(e) Despite the care taken over the American standardization, the 



160 THE MEASUREMENT OF ABILITIES 

tests from about 11 years onwards appear to be considerably too 
easy, and testers complain that far too many adolescents are getting 
high I.Q.s in the 130-160 range. It may be that these queer results 
arise merely from variations in the Standard Deviation of I.Q.s at 
different age levels (cf. p. 71); and Fraser Roberts and Mellone 
(1952) supply a useful correction table which allows for such varia- 
tions. But, as pointed out earlier (p. 73), no test based on M.A. 
scoring can be expected to yield really satisfactory results. 

The Wechsler intelligence scales surmount several of these 
difficulties. The original Wechsler-Bcllevue test (Wechsler, 1939) is 
now replaced by W.A.l.S. (Adult Intelligence Scale),^ which has 
norms from 17 to 75 years. The W.T.S.C. (Intelligence Scale for 
Children) is applicable from 5 to 15-t years, but is not — like 
Terman-Merrill — suitable for pre-school children. Each scale in- 
cludes 5 or 6 verbal sub-tests and 5 performance tests, and thus yields 
a verbal and a performance, as well as a combined, l.Q. Here too 
some adaptations are needed for British testees, but standard modi- 
fications have been prepared by the British Psychological Society. 
The pictures, blocks, etc., needed for the performance tests arc un- 
avoidably expensive, but the standardization and scoring are so much 
superior to those of any other scale that most psychologists con- 
cerned with individual testing of children and, especially, of adults 
prefer the Wechsler. 

Valentine’s Intelligence Tests for Young Children (1948) are 
particularly suitable for 2- to 5-year-olds, though also providing a 
short battery of individual tests for children up to about 1 1 years. 
This scale has the advantage that all the necessary materials are 
included in the handbook, or can easily be constructed by the tester. 
For 0-2-ycar-olds, Griffith’s (1954) Mental Development Scale is the 
most thorough, and the best standardized in Britain. It yields 
separate quotients in each oi five fields of development —Locomotor, 
Personal-social, Hearing-speech, Eye and Hand, Performance, and 
a combined or general quotient. In testing at this level it is especially 
important that the tester be thoroughly trained and experienced in 
managing babies and using the correct materials and instructions. 

Performance Tests , — As we have seen, these practical tests may be 

^ The original version may, however, continue to be used for some time in this 
country, partly because more thorough English adaptations have been worked 
out, and partly because it normally takes only 45 minutes to apply as compared 
with over 60 for W.A.l.S. 
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used when verbal ones are inapplicable, or they may usefully supple^* 
ment the predominantly verbal Binet-type test. Their chief merits are 
their greater attractiveness to children, and their capacity for bring- 
ing out a variety of temperamental reactions. A testee’s manner of 
approach to such tests, and his behaviour when confronted with 
difficulties, throws much light on his impulsiveness, persistence, 
complacency, and other qualities, which cannot as yet readily be 
tested or measured objectively. 

The majority of performance tests sufier from the practical draw- 
backs that they are very unwieldy and difficult to transport, and 
extremely costly. Some testers are suspicious of them because 
different tests often give very inconsistent results. A child with a 
Binet M.A. of 9 years may, for instance, pass performance tests at 
anywhere from the 7- to the 12-year levels. This is partly due to poor 
standardization, and to variations in the actual materials and 
methods of application. As they have to be applied individually, the 
task of providing accurate British age norms is a difficult one (cf. 
Vernon, 1937). Even if they were perfectly standardized, we should 
still expect considerable inconsistencies, since their inter-correlations 
show that each test involves a large specific factor, or unknown 
group factors. Many of them are also poor in reliability, and the 
evaluation of this feature is difficult because of the large practice 
effects if they arc applied twice to the same testces. Thus the mental 
tester should never place much reliance on a single performance test, 
nor on two or three. A small number may be sufficient to indicate 
whether a Binet l.Q. is likely to have been distorted by some verbal 
defect, but the median or mean score from at least half a dozen, or 
the total result from a battery such as Drevcr-Collins, W.A.I.S. or 
W.I.S.C., should be taken if a reliable performance test M.A. or l.Q. 
is desired. 

Undoubtedly then their validity, either as measures of g or of 
‘practical ability,’ is poor. True, we have seen that they may involve 
the spatial ability factor, k, to some extent, and so be useful in 
selection for mechanical occupations and technical courses. But 
there is no real justification for the view that they give a better 
indication of intelligence for daily life purposes than the more 
reliable verbal type of test. It is interesting to note that several of 
them are markedly affected by emotional maladjustment and neuro- 
ticism. Children tested at a Psychological Clinic tend to make poorer 
scores than on Binet. Delinquents, on the other hand, who are often 
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very backward educationally, do tend to do comparatively well on 
performance tests (cf. Earl, 1939). 

GROUP TLSTS AND INDIVIDUAL ThS FS 

The merits and defects both of individual tests such as Terman- 
Merrill, Wechsler and performance tests, and of group verbal or 
non-verbal tests, may best be brought out by contrasting them under 
the following scries of headings : 

1. Individual must be used with very young 
children because it is practically impossible to hold the attention of 
a gioup, or to get them all to act alike at the same moment, burther, 
it is not natural for children to apply their intelligence to symbols on 
paper — either words, pictures or diagrams — until they have settled 
down to formal schooling. Manipulation of concrete objects or 
indivicfiial conversation with the tester constitute much more appro- 
priate testing media Group tests arc available which claim to 
measure Mental Ages down to 6 or even 5 years. The instructions for 
these must, of course, be given orally. But their value is very doubtful, 
for the above reasons. They may sometimes be useful for classifying 
pupils aged 7 -f , i.c. children whose M.A.s are likely to run from 
5 to 9. 

By 9 or, better, 10 years, children can read instiuctions foi them- 
sehes, are thoroughly accustomed to class work, and can readily 
deal with a variety of verbal problems. This is therefore a veiy suit- 
able age at which to begin group testing. Abo\e 14, group tc-Nts have 
detinitc advantages o\er individual, both because they can readily 
be made dilTicult enough even for the highest intelligence levels, and 
because their items arc much less apt to appear silly and childish to 
suspicious adolescents or adults. Nevertheless, the W.A.l.S. or 
W.I.S.C. are preferable for abnormal cases who would not settle 
down readily to the grou]) test situation, e.g. the feeble-minded or 
patients in a mental hospital or clinic. 

2. Ease of Application. Group tests are easier to procure, to 
apply and to score, and they demand less training and experience on 
the part of the tester. Possibly, however, these are disadvantages. 
We assume too readily that group test performance is unalfceted by 
the tester and the atmosphere he creates. As shown in the next 
chapter, there are many errors of application and interpretation that 
inexperienced testers often commit. 

3. Time . — Needless to say, the time saved by applying group 
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instead of individual tests is enormous. For instance, a class of forty 
can usually be tested in less than an hour, and their answers scored 
in four to eight hours, whereas an experienced Binet tester needs 
about half an hour for each young child, three-quarters for an older 
one, and an inexperienced tester much longer still. The point, how- 
ever, is whether this period per child is worthwhile. The majority of 
psychologists would certainly say that it is, and that an individual 
test can reveal much better than a group test things which the 
child’s teachers and parents may have failed to recognize even after 
years of acquaintanceship. Probably the best compromise would be 
for a school psychologist or infants’ mistress to test each child 
individually at 6, for a group non-verbal or oral test to be given in 
class at 8, and verbal group tests at 10 and 14 (assuming that the 
results of selection tests are available at 11 +). Such results would be 
entered on a cumulative record card, and would provide a very 
reliable picture of each child’s intellectual development. Individual 
tests would need to be given from 7 onwards only to exceptional 
pupils whose group test results were highly irregular, or about whom 
some important decision had to be made for purposes of educational 
or vocational guidance. 

4. Time Limits . — Most of the items in the Binet scale and many 
individual performance or educational tests arc untimed, whereas 
the great majority of group tests have to be done under speeded 
conditions. Arc not, the critics ask, some children slower but surer 
than those who score well on these timed tests? As indicated pre- 
viously (p. 132), the answer is more negative than positive, but the 
question is a complex one. If time allowances on speeded tests are 
increased, slow'cr children certainly improve their scores, but they 
still generally remain poorer than quick ones. 1 he rank order is but 
little affected and (according to the principles of Chaps. II and IV) 
it is relative, not absolute, scores that matter. At the same lime experi- 
ments do show that with highly speeded and rather casv tests, 
children's performance is considerably alTcclcd by their quickness 
and by whether they aim primarily at speed or at accuracy; whereas 
with very difficult, untimed material their persistence or willingness 
to go on trying may affect their results as much as docs their ability. 
For most purposes it is desirable to compromise between these 
.extremes, and to avoid mixing up our measures of ability with what 
are essentially temperamental or work attitude factors. Further, it 
would be administratively most inconvenient to use untimed tests. 
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which different testees would take very different times to complete. 
Hence it is possible that some of our present group intelligence and 
attainments tests (particularly in arithmetic) are unduly speeded. If 
we inclined towards tests with longer time limits and more difficult 
items, very few children would alter their relative scores appreciably, 
but it is likely that our predictions of future educational achieve- 
ment would be slightly improved. We know, too, that emphasis on 
speed unduly handicaps older persons, say 30 years upwards, hence 
ample time limits are preferable among adults. It may be that it also 
upsets emotionally excitable or nervous children, and this is one 
reason why Clinic psychologists prefer to use individual tests where 
the testee gives his answer at his own pace. 

5. Objectivity , — With good group tests the conditions of 
application and the scoring are much better standardized than with 
most individual tests. Very little is left to the personal decision of the 
tester, and the marking is, or should be, completely foolproof. In 
actual practice many group testers still manage to commit gross 
errors, some of which will be pointed out in the next chapter. And 
modern individual tests like the Terman-Merrill and Wechsler afford 
much less scope for subjective whims than did the older ones. 
Nevertheless individual testers often have to judge the correctness of 
doubtful responses, and may, without departing from the instructions 
in the handbook, alter the testing conditions in a variety of ways. Yet 
there seems to be no evidence of serious discrepancies between the 
results of trained testers, although untrained ones may make grave 
mistakes, even sometimes leading to the wrongful certification of 
mentally defective children or adults. Rehability coefficients over 
short periods are very high, though greater difficulties leading to 
lowered reliability occur in testing children below the age of 5, and 
among maladjusted children or adults (cf. Tulchin, 1934). The ex- 
perienced tester should, however, be able to recogmze when his 
results are liable to be affected by extreme unco-operativeness or 
instability in the testee, and to treat them with due caution or to 
arrange for a re-test when desirable. 

Actually the rigid objectivity of the group test is distrusted by 
Clinic psychologists, since they know that the same physical situa- 
tion may mean very different things to different children. The 
flexibility of individual testing is to them a great advantage which, so 
far is known, they do not abuse. 

6. Dependence on Health and Mood , — Both group and individual 
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tests seem to be much less affected by these factors than might be 
anticipated. Experiments show little and often no difference between 
the scores of children when they are fatigued or ill, and when in good 
health, nor between those who are strongly motivated (e.g. by the 
promise of monetary rewards) and those who are merely encouraged 
in the ordinary way. Studies of the latter type sometimes show that 
testees who are making more effort may attempt more test items, but 
they also make more mistakes so that their scores arc scarcely 
altered. There seems then to be little evidence for such statements as 
that the examination conditions of group testing make some testees 
unduly nervous, or that others are stimulated to do better by the 
competitive nature of the testing situation. All such conclusions, 
however, are based on average tendencies, which admit occasional 
individual exceptions, and Binet testing certainly indicates that a few 
testees are so resistant or timid that they fail to do themselves*justice. 
Moie extreme differences in attitude to the test can undoubtedly 
distort the scores, and ceitain physical conditions such as bad eye- 
sight must have a like effect. No intelligence test possesses perfect 
reliability; and conditions such as these may be at least partly 
responsible. Variations in group test performance, as well as in 
individual, are known to be greater among maladjusted children. Yet 
at the same time there is no evidence that excitement or anxiety 
upsets the results of 1 1 1 selection tests among more than a small 
minority. 

The great advantage of individual over group testing is that in the 
former the tester can generally discern the influence of abnormal 
conditions, whereas in the latter he has scarcely any control over, or 
insight into, them. It may be that while high scores on group tests are 
relatively stable and significant, low scores can arise from a variety 
of sources other than poor intelligence. However, a lot more careful 
research is needed in order to determine definitely the part played by 
such factors. 

7. Dependence on Practice or Coaching . — Most intelligence tests 
are published, hence anyone can procure them, coach testees on 
them, and ensure considerably increased scores. It is a most foolish 
thing to do, since a child who is made to appear unduly intelligent 
may also legitimately be expected to be capable of unduly difficult 
school work. But whenever examinations for secondary school 
entrance, or for other competitive purposes, include intelligence 
tests, teachers and parents who wish the children to do well will 

. M.A. — 12 
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naturally be tempted. In individual testing it is fairly easy to recog- 
nize whejn children have been coached or practised, since their glib 
answers are very different from the answers of children who are 
puzzling out the tests for the first time. It is less easy to know what 
allowance to make. In group testing, on the other hand, one may 
suspect that scores are too high all round, though one can never be 
certain unless (as has actually occurred) a whole class answers 19 
test items out of 20 correctly, the twentieth being one where the 
teacher was mistaken. 

In order to prevent direct coaching, most Educational Authorities 
obtain tests which arc freshly constructed and privately standardized 
every year by Moray House or other organizations. But copies of 
older, more or less parallel, versions are readily available, and there 
are large sales of booklets which contain lest items very similar to 
those ill the unpublished tests. Particularly at the 11 | stage, and in 
areas where the provision of grammar school places falls far below 
the demand, intensive coaching in schools, at home, or by outside 
tutors, is known to be widespread. Not only does this tend to distort 
the normal school curriculum (cf. p. 199), but also it produces severe 
strain and anxiety among many of the pupils. 

The topic has received considerable publicity, and numerous 
investigations have been carried out. These show that coaching and 
practice on intelligence test materials do make an appreeiable 
difference at the selection borderline, though they are nothing like as 
effective as parents and many teachers imagine. Here we should 
distinguish practice in taking complete intelligence tests, without 
further instruction or knowledge of the right answers, from coaching, 
where items are explained and hints given. A single previous practice 
test raises the average T.Q. by about 5 points, and further practices 
yield diminishing returns up to a total of about 10 points. After 
some five tests there is no Turlher improvement, and I.Q.s riuctuatc 
irregularly or even begin to decline. When combined with coaching, 
the effects are somewhat larger, the average maximum rise being 12 
to 15 points. But these figures refer to children or adults who have 
never seen a group intelligence test before. When, as is usually the 
case nowadays, they arc fairly familiar with tests, the effects may be 
only half as great. Moreover it has been shown that the effects arc 
highly specific. Thus the kind of coaching which is given from pub- 
lished books of miscellaneous items, and which does not include the 
taking of a complete parallel test under examination conditions, is 
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remarkably ineffective, producing barely a 5 points rise. Again, the 
maximum improvement occurs remarkably quickly; if practice or 
coaching amount to more lhan a few hours they seem to defeat their 
own ends. 

We have admitted earlier that good schooling and home back- 
ground do genuinely stimulate the development of intelligence. But 
it is clear, from the facts just stated, that intelligence cannot be 
‘taught’ in at all the same sense as school subjects arc taught. Prac- 
tice and coaching merely provide a limited facility in dealing with the 
rather roundabout and artificial types of items that occur in most 
group tests, in understanding and following the instructions quickly, 
and in the concentration of clfort. Thus they have distinctly less 
elTect on the more natural types of items used in individual tests, 
where the testec supplies his own answer in his own words, than they 
do on most multiple-choice or recognition-type group test items. 

In areas where coaching is at all common, the only fair solution 
seems to be to allow a limited amount of practice and instruction on 
parallel tests in the schools. This will give uncoached children a 
reasonable familiarity with the sort of tests they arc to take, and will 
generally bring them to the level of the more intcnsi\ely coached. 
But it would be better still if tests were used primarily for guidance 
rather than for selection, since then the incentive to coach would be 
removed. At the same time, the mental tester can never afford to 
ignore practice or coaching; the amount and kind of previous ex- 
perience of tests that tcstces have had does make a dilference. for 
example, in any experimental investigation involving the comparison 
of two or more tests, the scores on those taken later will be to some 
extent alfected. The educational psychologist should realize that 
published test norms may be inapplicable t(^ children who are 
unduly ‘test-sophisticated,’ and should usually avoid giving tests in 
any school at intervals of less than, say, two years. 

8. Validity . — The various versions of the Binet test, as already 
mentioned, measure a poorly analysed hotch-potch of abilities, and 
they have been severely criticized by Spearman, Cattell and others on 
account of the weaknesses of their statistical construction and 
theoretical foundations. Burt (193%) likewise points out the hetero- 
geneity of item content. Perhaps still more serious are the variations 
in content at different age levels: for example, the verbal bias in 
Terman-Merrill becomes much more pronounced at the top end. 
Thus the scale cannot be said to be measuring the same intelligence, 
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or even the same pattern of abilities, throughout. A further conse- 
quence of this heterogeneity is that it tempts testers to make dubious 
subjective analyses of types of items, as when they note that a child 
does especially well on the more practical items, or badly on the 
memory items, and jump to the conclusion that he is a ‘practical’ 
child in daily life, or has poor ‘memory’ for school work. The 
experienced clinical psychologist can obtain many suggestive hints 
as to qualitative features of the child’s intellect and temperament 
from his responses to particular items, or groups of items, but they 
are only hints which requiic to be followed up in far greater detail 
before they can be regarded as of any diagnostic value. Qua test, it 
only yields the one general score, which can be regarded as rather a 
vague mixtuie of g, v : ed, and other factois. 

In contrast, the well-constructed group test is highly homo- 
geneous, each of its sub-tests or types of items involving much the 
same factor pattern. Spearman would have claimed that it was an 
almost pure measure of g, though we realize now that the content 
and form of the items, together with difficulty level and timing 
conditions, do introduce other factors. It seems paradoxical, then, 
that most psychologists prefer the relatively unscientific Terman- 
Merrill test when they are called on to make any decision involving 
the intelligence of an individual child. They regard it as their most 
valid mental measuring instrument, and as giving the best available 
index of a child’s general intelligence. That less trust can be put in 
group test scores is shown by the fact that diflerent group tests 
often give somewhat discrepant results. If they are fairly similar in 
make-up the correlation between them (over a few days or weeks) is 
likely to be between 0 8 and 0 9 in a representative group. But with 
greater diversity, as between verbal and non-verbal tests, the con ela- 
tion may sink to 0 7 or below, and discrepancies between a child’s 
two I.Q.s may reach 30 points or more, though the average or median 
difference should still not exceed about 7 points. At the same time, 
group verbal tests actually give at least as good piedictions of futuie 
educational standing as do the Terman-Merrill or Wechsler, among 
normal children aged 10 upwards and adults. The superior validity 
of individual tests is thus partly illusory, though it probably does 
hold good for younger children, particularly in regard to intelligence 
outside the school. 

The probable explanation of this situation is that the intelligence 
which we recognize in daily life, in school or out of it, is not pure g. 
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but a mixture of abilities. Binet and his followers chose items largely 
from children’s everyday behaviour and thinking, regardless of their 
factor content, and the very impurities that they included happen to 
overlap very well with the mixture we want. In the group test, how- 
ever, the items arc mostly of a more restricted and artificial type, 
though they happen to coincide fairly closely with the kind of in- 
telligence required for passing examinations. The child is forced to 
express his abilities through a less natural medium, and this intro- 
duces such components as facility-in-doing-multiplc-choicc-itcms- 
at-speed, which arc irrelevant to anything we want to predict. 

A further consideration, especially among younger children and 
among the more unstable or abnormal teslees of any age, is that the 
group test yields only a numerical score, with none of the additional 
indications which are so useful in interpreting individuaMcst re- 
sults. The Binet tester has much more effective control over the 
testee’s state of mind. He ensures favourable co-operation, and can 
judge, albeit subjectively, whether his responses are typical of the 
best that he can do. Despite the small efiects of mood, health, etc., 
on group test scores, there are likely to be many uncontrolled factors 
which upset these scores in individual instances, and which partly 
account for the imperfect correlations between different tests. It is 
artificial, again, to separate off intelligence as a purely cognitive 
variable from the rest of the personality. Normally it always operates 
in a certain emotional and social context. The group tester tries to 
ignore this; but the individual tester, by obseiving the child’s man- 
ner of approach to difficulties (particularly those offered by per- 
formance tests), from the kind of test response and from general 
conversation, can build up a picture of the child’s whole personality, 
and see how that personality applies his available intelligence. Thus 
he obtains a much more complete and practically useful view than 
can be got from a more scientific measure of ability factors, though 
admittedly this view is liable to subjective distortions. 

The present writer would hold that the most reliable testing of 
individuals must be done individiiallv. But instead of one scale of 
somewhat indeterminate and vaiiable content, he would prefer to 
sec developed a series of shorter and more distinctive scales or sub- 
tests. Each sub-test should be shown, by factorial studies, to measure 
as homogeneous an ability as possible, if necessary over a limited age 
range only. Such abilities need not be pure factors. Indeed over- 
lapping is unavoidable, as they would all involve g, and possibly 
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other factors, in common. But they should cover as wide a range of 
mental functions as possible, and so enable us to carry out more 
penetrating and more scientific differential diagnoses. Instead of 
merely summing the scores to provide a measure of general intelli- 
gence, we should combine the results of several scales with appro- 
priate weights to measure, say, intelligenee in everyday life situations ; 
a different combination would give aptitude for academic education, 
another for technical education, yet another for mental deterioration 
in different types of psychotic and brain-injured patients, and so on. 
Such weightings would be based on actual follow-up research, not 
on subjective judgments of what the scales measured. The Wechsler 
tests already go some way in this direction, but their sub-tests tend 
to overlap too highly, and they arc mostly too unreliable to yield 
useful c|iffcrcntial diagnoses. They give a good verbal T.Q. and a 
total (verbal + performance) l.Q. ; but attempts to use patterns of 
scores, or weighted scores, on the different sub-tests for revealing 
various types of mental abnormality, seem to have broken down. A 
number of batteries of so-called differential aptitude tests have been 
published in America, descriptions of which arc given by Anastasi 
(1954). Though some of these give promise of being more useful in 
vocational guidance than a single general test, they too suffer from 
insufficient reliability and too great overlapping. Thus the develop- 
ment of a really satisfactory set of tests, beyond the stage that 
Wechsler has reached, is by no means easy. 


LISTS OF TFSTS AND METHODS OF MEASUREMENT 
IN EDUCATIONAI. PSYCHOLOGY 

The following list includes most of the mental tests available in 
Britain at the time of writing, together with some particulars of age 
ranges, time requirements, publishers, prices, etc. A number of older 
tests, little used nowadays or out of print, arc excluded; so also arc 
most tests whose sales arc restricted to Directors of Education or 
other specially qualified persons. A few tests published in Australia 
or in U.S.A. arc included which seem likely to be useful to British 
psychologists and teachers. These and other tests from such countries 
can generally be ordered through the National Foundation for 
Educational Research in England and Wales (referred to in the list 
as NFER). 
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Publishers. — Publishers’ names are abbreviated as follows : 

AE Australian Council for Educational Research 
B Baird Scientific Instrument Co., Edinburgh 
CL Crosby Lockwood 
G Gibson, Glasgow 
H Harrap 

ICP Institute of Child Psychology 
IPAT Institute for Personality and Ability Testing^ 

L H. K. Lewis, London 
LG Longmans, Green 
M Methuen 

NFER National Fcnindation for Educational Research 
NI National Institute of Industrial Psychology 
N Nelson • 

OB Oliver and Boyd, Edinburgh 
PC Psychological Corporation, New York 
SRA Science Research Associates, Chicago 
U University of London Press 

Prices . — The prices quoted arc those current in 1955 for 100 copies 
(including purchase tax), unless otherwise indicated. Naturally these 
arc liable to fluctuation. Group tests arc often available in dozens or 
sets of 25, at slightly increased relative cost. Single specimen copies 
with manuals are usually issued at about 2/- per test An asterisk (*) 
in the prices column indicates that the answers can be written on 
blank paper or special answer sheds, or given orally, so that the 
printed blanks can be used o\cr again. Where no price is quoted, 
the necessary material is gi\cn in the book referred to. 

Times . — When a definite figure for time limit is given, this usually 
means the actual working time, apart from instructions, etc. An 
approximate ligiirc, c.g. 20-30, or c. 25, shows the number of minutes 
including instructions, or indicates that the time varies with different 
testees. 

A^e Ran(^es.~ -These arc given in two columns headed C.A,, and 
M.A. The former indicates that percentile, standard score or 
other norms are available for all children in this range, whereas the 
latter implies that mean (or median) test scores only are listed for 
certain ages. Thus a test with C.A. 10-12 covers a wider range than 

1 Agent, Mrs. N. M. Barnes, The Grammar School, Henley-on-Thames. 
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one with M.A. 8-14^, since the latter will not discriminate below 
I.Q. 80 at 10 0 years, nor above I.Q. 120 at 12 0 years, 

References to books or articles where descriptions of, or norms 
for, the test may be found are generally shown, by date, after the 
author’s name. Some additional references and notes are given in 
small type. Readers who require a detailed critical evaluation of any 
American published test should consult the latest edition of Buros’s 
(1953) Mental Measurements Yeathook. This, however, covers only 
a proportion of British and Australian tests. 



ATTAINMENT, ACHIEVEMENT, OR EDUCATIONAL TESTS 
Examinations, yielding school or college marks (cf. Chaps. XI-XIV). 
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(v) Miscellaneous English Tests 

Cattell (1936), Midland Attainment Tests in — 6^-14 
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Alexander, Thanet Test .... — 10^-12-^ 


178 


THE MEASUREMENT OF ABILITIES 


(N <N CN 0\ 

< c+4 (4^ c+i ^ 




o 00 

Tj- 



O 

O 

era 

O 

o 


m 

VO 

2; 

7 2 

ov 

O O 

d 

d 

^ <N 1 

d 

m 

d 

rn ^o 


O 

r-H 

.-ir< 


T 1 


1 

oo 

C^nS 

1 

H<N 




>. 

rH 

CJ 

_■*— * 

• 1— ( 

'-d 

EE 


D 

'£ 

< 

P^» 

S 

43 


’ S 

.tE 


5 


^ <D ' 

P5 o-g 

s 

H o.a, 

2 - B 
^ wj 5 
>=N c 'S 
Q *2 • 

g<i; 


im 

^ ^ ^ 


<u iJ d 

c C 
' >^T3 cj 

E3(C u 

P-SS Ji 

g<;^S 

E ffi 


-d S 2 

dj -p 

a ^ 

^ ^ <i> ^ 


D-( 

o 

o aj 

•s-po:. 


wEw g 

jD K 

c W ^ ^ 

sE^ 6 

d !!!4 i:* 

JZO 


"E ' S 

cd <N ^ cd 


"d .2 d 

r 2 H 

d (u QJ O 

^ c d Ll. 

? ° I 

> ft c « 

-S'® B ■“' o 

s . o C'h" 
- 



MENTAL TESTS 


179 


O rn 

On r- ' 


cn ^ 


W I UJ I w 

U- I U-, I U. 


X P 


>n 

0 

0 

0 

-20 

45 


0 

1 

1 7 

1 "7 

. ON 

1 1 

cn 

00 

1 


' 0 

* 0 

* 0 

1 

0 .; 


1 

0 

01 

or 

■rr 


»— H 


<^1 


u s 

00 


I 74 . 




E 45 wo 

u T3 ' 

O hi 


i-g s O 

; (L) o 5 


42 o c 

/ — s o> o 

^ O r 

C3 ^ ^ c ^-T ^ 

r-H C /^^r- 

•2 ^ . g ?3 S 
S u c'ism 
S-^-2^ , •- 

i,g-S ^Ov 
^ '' pqm-H 

-do'-'.. ^ 

^ d.iJ <L» S 
’P c tJ 2 
OJ ^ .S 
Wc3« > 


X. ^ G^ 

^ ; * o^ 


O <D 

ypumi 

g-H g’ 

•E? o’^ 

^ o ji: 
^ QJ ( 

►s ^ 


■N. c ^ 
? 9 WJ 


• r* 

U ■- 


, 3 ; 1 ££Es 

S -ON ^ ^ ^- 

ON (/5 ro *0 o P 

5 ^ 

‘C O c 

cl 5 < 

c/; ^ d (U “ ^ 

^ t-i ^ CO 



, Coloured Progressive Matrices, Sets A, 5-1 1 — • c. 30 L, H 8/8/0 per 25* 


180 • THE MEASUREMENT 6F ABILITIES 


I 











1 

§« 

■«-» 

u 

-d ^ o 

G’g o 
<« 

CA 

IT) 

<N 1-H 

O 

c/} 

x> S 

p ^ 





in 




00“ 

4 4 


' «— 1 

<n 


Im Km«M 




2 Porteus (1952) Maze Test — 3-17 5-15 H 6/- book 

> Each of 12 mazes ..... — — — — 6 - 

I Seguin-Goddard Formboard; cf. Cattell (1936), — 3^-13— 2-6 B 3T4/6set 

Vernon (1937) 

Watts, Group Performance Test (Formboards) 10-12 — c. 45 NFER — 


MENTAL TESTS 


M 

O 

O 

Xi 

rn 


M 

O 

O 

X 

-H On 


O O 
r-i oc — . 


O 

•rT 


O 

O o o 

o o' 
vo ^ r j 

r' r r 1 r j ^ 


H H 




CO 

H 

CO 

W 

H 

m 

u 

:z; 

LU 

o 

X 

X 

UA 

H 

/C 


o 

o 

O 

•n o 

O 



^ 1 

*0 tT rn 

rn 

O O) ^ 


. I 



r^i cn f^; 

d 


<0 




I 


I 

ON 


I 

r- 


I : I MM 




I oo 
r-- 


S' . o — 

,5 -n ^ 

"r ^ M I _i , 

cn ' O CO c^ 


P-( 

23 

O 

o 


D. 

P 

p 

o 


'D 

o 


! 

Vi 

o 


CM 

i/N 

ON 


O 

H 

"o 

o * 

X 

o 

CO 

2 ^ 
. — ^ CO 


-.2 c ^ 
cog 

^ c S 

<2 § 

^ o 

H 


cQ ii: 

, o 
.<Lm 


u- 

tou. o 
C . CO 


r 3 OJ 


<D 


§ O o 3 

<-0^3 
oU 
oq >- y .. 
.2 
C 

=3 u: o 


< 

E 

Ui 

O 

Uh 


3 H 

o: — ' — ' CQ o 

<UU 


■H 


(L> S 

to u. 
c o 

s = 

9 =3 

CJ TO 

'q 

P ^ 
Clj o 


O 


•as 

cd ^ 

Q Pu, 


pq 


181 
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Jenkins, and Lee and Jenkins, Non-Verbal Scale 10-11 — * 30 NFER 3/6/0 

of Mental Ability, Forms 1, 2. (Restricted 
sales) 



Lewis, Non-Verbal Test of Mental Ability (for 8-15 — 30 LG 1/3 each 
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ilinson, Northern Test of EducabiliU . — 8-17 50 U 217 H* 

Answer sheets ..... — — ' — — 15/1 
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CHAPIKR X 


HINTS TO TESTERS 

1 . Choice of a Test , — In choosing a lest for some particular purpose, 
the tester’s first consideration should, of course, be its validity for 
that purpose. As \vc have seen in Chap. TX, he may be able to judge 
its validity to some extent by inspection of its content, but he would 
be better advised to seek out experimental proof of what it is that 
the lest measures. Investigations by its author, or by others who have 
used the test, should be consulted. The mere fact that it is called a 
test of such-and-such an ability does not constitute evidence. The 
second consideration (which greatly aficcts the first) is the test’s 
reliability. Generally speaking, a test must be fairly long, and its 
scoring must be objective, if it is to be satisfactory in this respect. 
Most authors of good tests publish evidence of their reliability in the 
accompanying manuals. Little trust should be placed in tests whose 
scores are not known to yield a reliability coctficienl of 0-90 or over 
in a representative group — say, a complete year-group of children. 

2. Age Range, — Few, if any, tests presume to cover all ranges of 
ability, and most educational and intelligence tests are only intended 
to cover quite a limited range. Thus the tester should be careful to 
choose from among the available tests those which will, when applied 
to a large group of his test'ees, be likely to yield a normal distribution 
of scores, and which will not unduly curtail this distribution at cither 
end. For instance, in testing an ordinary school class, he should first 
estimate roughly the range of M.A.s or E.A.s which it contains. A 
class of average C.A. 10 years may often include individuals with 
M.A.s (on a group test) of 6 to 14 years, though the great majority 
should lie between 1\ and 12^ years. He should then study the ages 
for which the various tests provide norms, and pick one which covers 
as much of this predicted range as possible. If he has reason to believe 
that the average M.A. or E.A. of the class will be, perhaps, a year 
above or below its average C.A., he must correspondingly adjust his 
choice of test. 

The ‘dilTcrentiating power’ of the test norms should also be noted. 
They should show a fairly large and even increase in score from one 



HINTS TO TESTERS 


189 


age level to the next. Suppose that a test containing, say, 200 items 
has the following norms from 6 to 13 years. The increases*arc large 
up to 9 or 10, but then tail off, till there is a difference of only 5 items 
between the 12- and 13-year scores. Such a test will clearly not dis- 
criminate or differentiate well at the upper end, and its effective range 
of norms is from 6 to 1 1 rather than from 6 to 13. 

Years .6 7 8 9 10 11 12 13 

Score . 40 71 101 128 151 168 179 184 

Similar considerations apply to the choice of tests whose norms 
are issued in percentile, or other, forms. 

3. Convenience of Arrangement and Scoring.-~\[ a test is to be 
used extensively, attention should be paid to its general arrangement 
from the point of view of case in applying and scoring. With some 
performance tests and tests of special aptitudes, much time is wasted 
in setting out the material, or the instructions are very elaborate, or 
the scoring unnecessarily detailed. More modern American tests 
usually try to reduce these inconveniences. 

Group tests of attainment or intelligence may also vary greatly in 
‘mechanics’ or ‘lay-out.’ Some of the older ones ha\c insufficient 
ifistructions and sample items or practice sheets, so that tcstecs often 
fail to understand what they have to do, and a lot of additional oral 
explanation is needed. Sometimes the tcstecs have to answer in their 
own words, and the scoring manual never allows for all the possible 
ans\vcrs which turn up, so that much time is wasted in deciding 
whether or not an answer shall be counted. The positions of the 
answers arc often scattered all over the page, instead of being ar- 
ranged in the right-hand margin. True, stencils arc often provided 
which, when laid over the test page, show up the correct answers 
But, according to the present writer's experience, it is far more 
tianiblc to lit the stencils over each successive page than it is to com- 
pare the testee's booklet with a booklet on which all the correct 
answers arc clearly marked. 

Many American group tests arc now issued with an interleaved 
carbon-paper device, so arranged that all the correct answers, and 
none of the incorrect ones, are automatically shown as crosses on 
the back of the page. The scorer then merely has to count the num- 
ber of crosses. Alternatively, tcstecs may check their answers with a 
special carbon pencil; their blanks arc then passed through an 
electrical machine which counts instantaneously the number of 



190 THB-M^ASUREMENT OF ABILITIES 

carbon marks in correct positions, through the amount of current 
which passes. In this country, however, testing is seldom carried out 
on so large a scale as to warrant such ingenuity, and its consequent 
expense. 

4. Procedure. — Many of the following hints may seem puerile to 
some readers, yet they are all based on the writer’s experience of 
mistakes actually committed by amateur testers. Such mistakes may 
often lead to quite serious consequences. 

The tester should first study the instructions and the test material 
carefully with a view to seeing precisely how the test works, and 
should then follow these instructions to the letter. Often it is worth 
the tester’s while to take the test himself. In Binct testing, the results 
should not be trusted until such time as the tester practically knows 
the instructions, and the essentials of scoring, by heart, and until he 
has given the test to, say, twenty cases. 

5. Modification of Instructions or Materials . — The inexperienced 
tester is very apt to find flaws in the test or instructions which, he 
thinks, could be eliminated by some simple modifications. Veiy 
likely the test could be impioved, but then it would no longer be the 
same test, and the published norms would no longer hold good. For 
the same reason a printed group test must not be duplicated or 
mimeographed in order to save expense. Quite apait from the 
probable infringement of copyright, the dilliculty of the test is 
likely to be altered thereby. 

6. Tuning . — Accurate timing is essential. In some group tests a 
mistake of 5 seconds may quite possibly lead to an error of 1 % or 
2% in average T.Q. Testers who do not own stop-watches arc advised 
to use omnibus tests when feasible, since these require only a single 
timing. If times arc kept by a watch with a seconds hand, the exact 
minute and second at which a test is to stop should be written down 
as soon as the test has started. Mistakes very easily occur if the 
tester trusts to his memory. 

7. Underlining and Erasing. — Before giving a group test to a class, 
all rulers and indiarubbcrs should be firmly removed, otherwise the 
pupils will waste several minutes in ruling their underlinings, or 
rubbing out responses when they change their minds. If the test 
instructions do not already say so, the tester should explain to the 
pupils that, in order to change a response, they should scribble out 
the first choice rapidly, and then mark or underline the new choice 
clearly. 
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8. Distractions and Copying . — In all testing, possible sources of 
distraction should be avoided. For instance, a group test should not 
be given when a singing lesson is due to occur in the next classroom. 
The testecs should be spaced as far apart as possible, since it is very 
easy, in multiple-choice tests, for one lo overlook and copy the 
answers of another. Often the younger children have no intention of 
cheating, but are yet extremely anxious to write down the same as 
their neighbours, simply because the whole situation is unfamiliar to 
them, and they are afraid of doing the ‘wrong thing.’ 

9. Practice Sheets and Samples . — In most group tests for children 

there are preliminary practice sheets or sample items, where help 
may be given. The tester should make >urc that even the dullest 
tcstces understand what to do, and that they can manage at least one 
or two of the sample items by themselves. ^ 

10. Coaching. -No practice or coaching of any kind, beyond 
what is specified, should be given. Testees who gain an advantage by 
such means will no longer be comparable with those on whom the 
test was standardized, and the norms will be useless (cf. pp. 165-7). 

11. The "‘Subjective^ Situation. —At the same time as adhering to 
the standard objective conditions of testing, the tester should try to 
obtain, even in a group lest, a favourable subjective atmosphere. If 
the tester himself is apprehensive, or bored, his mood is likely to 
communicate ilself to a group of children. The testees should not be 
excited and anxious, nor, of com sc, lackadaisical, in their attitude 
to the test. Teachers should not peer over children’s shoulders while 
they arc working to see how they are getting on ; it inevitably upsets 
them. Still less should thc> criticise, or help, children with wrong 
answers. 

12. Testing and Teaching . — Many teachers make bad testers be- 
cause the teaching attitude is so ingrained in them that they cannot 
refrain from drawing morals, or pointing out mistakes. A test is not 
a kind of lesson, and children should not be expected to learn any- 
thing whatsoever from it. Rather it is a matter of stocktaking. The 
essential object is to obtain a record of the testees’ performances 
under certain standard conditions, so that these performances may 
be compared with those of other similar testees who have been 
tested undei the same conditions. 

13. Scoring . — The scoring instructions must be followed as 
meticulously as the instructions for applying the lest. The tester may 
sometimes disagree with the response provided in the scoring key, 
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and think that some alternative is just as good or better. This may be 
so, but once personal opinion enters the scoring ceases to be ob- 
jective, and the results again cannot be compared with the test 
norms. The scoring of most group tests is quite mechanical, but 
errors very readily creep in, both in checking the right responses and 
in adding up correct checks. It is extremely desirable for such tests 
to be scored twice, preferably by ditferent persons. 

14. The Subjective Situation in Individual Testing . — The conditions 
of individual testing require still more tact and sympathy than those 
of group testing. By means of preliminary conversation the child 
should be put at ease, and interspersed irrclcvancies should keep him 
interested and responsive throughout. If he shows signs of fatigue, 
distraction, or resistance the testing should be broken oil*. The good 
tester ddevelops a kind of ‘bedside manner,’ and avoids formalities. 
He does not read the oral tests from the handbook, but, knowing 
them by heart, says them in a conversational tone. In order to keep 
the child interested he must himself sound interesting. Fneourage- 
ment should follow every response, but some caution is needed in 
praising incorrect responses, since many children may realize when 
they arc wrong and become careless or discouraged if there is no 
criticism nor stimulus to do better. The presence of a third person, 
particularly the mother or head teacher, constitutes a distraction 
which should always be avoided. 

Tn giving the older versions of the Binet scale, it was the common 
practice among trained testers to Jump about from one pait of the 
scale to another. This enabled the tester to choose the test which 
seemed to him to be most appropriate to the child’s mood at the 
moment, also to save time by grouping together tests with identical 
instructions (c.g. the digit memory tests); and it prevented all the 
most difficult tests from cc uing at the end of the session, 'fliis pro- 
cedure has been forbidden by Terman and Merrill (1937). But Hutt 
(1947) shows that it makes no difference to the T.Q.s of normal 
children and is, if anything, more fair to unstable children. The 
Terman-Merrill handbook gives several other useful hints on obtain- 
ing good rapport, while at the same time adhering strictly to the 
standard instructions and scoring. 

15. Record Keeping . — The following principle is commonly 
adopted by experienced testers: never spend any of the child’s time 
in making records of his responses. Notes should certainly be made, 
not only of test responses, but also of manner, speech, etc. All the 
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recording, however, should be done while the child is engaged either 
in doing a test, in irrelevant conversation, or in listening to the 
instructions for another test; and it should be so unobtrusive that it 
never constitutes a distraction. Poor testers not only waste lime in 
recording responses and looking up the scoring, but also allow 
awkward gaps to occur which entirely spoil the favourable atmo- 
sphere of alertness and interest. The reason for this is usually mere 
lack of familiarity with instructions and scoring. 

16. Impartiality , — While the good tester enters very fully into the 
thoughts and feelings of the child he is testing, yet he also needs to 
exert considerable self-control. There are various ways in which he 
may (quite unwittingly) give illegitimate hints, for instance, by over- 
emphasizing the crucial words in an oral test, or by making cmpathic 
movements in the direction of the correct blocks, holes, etc., of a 
performance test. Further, he cannot help but feel more sympathetic 
with some children than with others. He should, therefore, fre- 
quently ask himself: “Am 1 in any way being prejudiced by my 
liking for this child ; am I giving hints or crediting him with doubtful 
passes because T believe that he is bright, which i would not do if 1 
believed that he was dull?” 

17. Studying the Whole Child . — The tester should never regard 
his job merely as measurement of the l.Q. or other test scores 
(although many doctors, teachers, and others may seem to regard 
that as his sole function). Rather he should try to understand, and 
to build up a picture of, each child’s whole personality. By doing so 
his test results are more likely to be accurate, since he will realize 
when they are being distorted by emotional or other factors ; and he 
will certainly be better able to interpret their significance when the 
guidance or treatment of the child is under consideration. Although, 
as we have seen, he is not justified in analysing a heterogeneous test 
like the Binct scale into tests of hypothetical discrete abilities 
(memory, verbal, practical, etc.), yet he can pick up innumerable 
hints regarding the child's intellectual and emotional characteristics, 
which can later be explored more thoroughly. Suggestions of specific 
educational dillicultics may readily be followed up. Buehler (193S) 
shows that responses to the Ball and ITeld (or Purse and Field) test 
may give valuable diagnostic indications. Other Binet items and, still 
more, performance tests like the Porteus Mazes may be equally 
revealing (cf. Vernon, 1937 ; Porteus, 1952). These deductions should, 
of course, be checked later by comparing them with other sources of 
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Information, such as the psychiatrist’s case-record, when available. 
In this way a tester can always be extending his methods and im- 
proving their validity. 

18. Interpretation of Test Results: Unreliability , — No mental test 
score should ever be accepted at its face value, nor trusted in the same 
way as physical measurements are trusted. Even the best tests, it 
should be remembered, only measure to within a certain Probable 
Error. In other words, a test score should be regarded as the centre 
of a range of possible scores. It may be correct to within, say, 5%, 
but also it may be in error by as much as 20%, unless it has been 
confirmed by the results of other tests which narrow the range of 
uncertainty. 

19. Units of Measurement . — ^Thc imperfections of mental units 
should also be borne in mind, for example, the variations in size of 
M.A. and E.A. units (cf. p. 70-1). The l.Q. is still more difficult for 
teachers and laymen to interpret correctly. When derived from 
M.A./C.A. it is primarily a measure of rate of mental growth. But 
when it represents a standard score (as in Moray House, Wcchsler, 
and other tests), it is — like a percentile — a measure of brightness 
relative to other children of the same chronological age. In neither 
case is it an index of the amount of intelligence the child possesses 
at that moment; similarly the E.Q. is not a measure of present 
attainment. The M.A. or E.A., on the other hand, do provide such 
indices. The point may be brought out by the following example of 
the test scores of two children, A and B. 

A: . . C.A..9:0 M.A. 11:0 J.Q. 122 

B: . . C.A. 8:0 M.A. 10 : 6 l.Q. 131 

Here it is A, not B, who is the more intelligent, since he can do more 
difficult intellectual tasks, a. id can be expected to be more advanced 
in school work. It is true, however, that B is the brighter relative to 
8-year-olds than A is relative to 9-ycar-olds; and as B is growing 
more rapidly in mental powers, it is likely that he will catch up with 
A in another four years, and (if the test is reliable and valid) that he 
will, as an adult, be the more intelligent of the two. 

In actual practical work of educational guidance or emotional 
treatment, the tester need hardly bother to calculate the l.Q. or 
E.Q., since what matters is the present level of the child’s intelli- 
gence and educational capacities. The only proper use of the l.Q. is 
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for predicting a testee’s future career, i.e. for assessing approximately 
his ultimate educational or vocational level. 

20. Dependence of LQ, on the Test Employed . — ^The naive tester or 
layman usually assumes that a child would obtain the same I.Q. 
whatever the test employed. This is so far from true that the test 
employed should invariably be quoted along with an M.A., E.A., 
I.Q., or E.Q. The reasons for these discrepancies have already been 
given, and will merely be summarized here. 

{a) Chance errors of measurement, due to the fact that no test is 
perfectly reliable (cf. pp. 125-9). 

{b) Many tests arc not well standardized. For instance, American 
group tests may give results which are too lenient (pp. 66-8). 

(c) Different intelligence tests have different dispersions of M.A.s 
and I.Q.s, so that one test makes a bright child out to be much 
brighter, a dull child to be much duller, than does another test 
(pp. 71-3). 

(cl) The testees may have had more practice or coaching relevant 
to one type of test than to the other (pp. 165-7). 

{e) All tests arc liable to distortion by unsuspected group factors, 
or by emotional or other influences, about which we possess little 
exact knowledge, and which we arc not able to control (pp. 163-5). 

In view of this uncertainty over the norms and the dispersion of 
group intelligence tests, the inexperienced tester would be well ad- 
vised never to calculate individual I.Q.s from them, but to treat their 
results in purely relative fashion, i.e. in the way which was recom- 
mended for educational tests (cf. p. 68-9).^ The class teacher can 
obtain from a group test piactically all the infoimation w'hich it is 
capable of yielding if he or she simply finds which child in the class 
has the highest score (corresponding to the highest M.A.), which the 
next highest, and so on. When test results are thus confined to rank 
orders, they arc far less likely to be misinterpreted than is common 
at present. 

21. Conclusion . — Mental tests have been widely, and often un- 
fairly, criticized. But the progress of testing is probably hindered to 

^ An exception may be made in the case of 1 H selection tests, which almost 
invariably adopt a Standard Deviation of 15. Such a use ot quotients provides 
the simplest way of making age allowances, without which fai moie of the older 
children in a year-group of candidates than of the >oungcr ones would gain 
places. At the same time, egregious statistical cnors, leading to gross unfairness, 
can be and often are committed when Education Authorities try to run such 
examinations without the advice of a statistically trained psychologist. 
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a greater extent by its friends, who are ignorant of its limitations, 
than by ‘its enemies. One of the main objects of this book is to 
analyse these limitations and to explain the underlying prineiples 
which must be understood if tests are to be applied and interpreted 
properly. No one ought to use tests who is not willing to familiarize 
himself with these matters, and with the literature which bears on the 
particular tests that he employs. In skilled hands, testing provides a 
far surer and more accurate tool for the assessment of abilities than 
do subjective impressions or the ordinary examination paper. But 
human nature is far too complex to be measured in the simple and 
direct ways in which physical quantities can be measured, and the 
over-enthusiastie but unskilled tester is only too likely to make 
serious mistakes. 



CHAPTliK XI 


EXAMINA I IONS 

Functions of Examinations, — 'I'lic chief finiclions of ordinary wiitten 
examinations may be summari/xd as follows; 

1. They arc employed as tests of achievement, to prove that an 
individual pupil has, or has not, acquired a certain amount of know- 
ledge of a subject. 

2. riicy are given to a whole class, or school, for a similar purpose, 
or for assessing the elliciency of the teachers whose pupils arc 
examined. Although 'payment by results’ has disappeared, schools 
are still commonly judged by the public on the basis of their ex- 
amination successes. And both head teachers and inspectors try to 
ensure the maintenance of certain standards by means of periodic 
oral or written cxamiiuttions. 

3. Many of the most impoitant examinations are assumed to 
possess a prognostic function, i.e. to be capable of predicting future 
achievements. Thus they are used to bar the gates between the 
primary and the secondary school, between the secondary school 
and the university, between the unncrsity and many piofessions; 
and they arc supposed to select the few who are best lilted for these 
higher stages, 

4. Many qualities besides achievement, or promise ol achieve- 
ment, m particular scholastic subjects, arc believed to manifest them- 
selves m examination success. The business employers who insist 
that their clerks shall have passed Matriculation or the General 
Certiiieate do not actually desire employees who are govid at Latin, 
biology, and the like. When a^ked why they attach such impoi lance 
to academic examinations, their answers show that they hope, rather, 
to acquire clerks of good general all-round abiht> in non-acadcmic 
as well as in academic subjects. They say also that success at an 
examination requires a certain amount of perseverance and in- 
dustry, some docility to discipline, and calmness under pressure; 
and that these qualities of character are desirable also in their 
employees. The signilicance attached to a university degree is often 
almqst as variegated. Employers arc probably quite as much to 

M.A.— 14 
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blame for the stranglehold of the contemporary examination system 
as are the teaching profession and the education authorities. 

5. Another function of examinations whose existence is seldom 
explicitly admitted is that they constitute an incentive for stimulating 
pupils and students to work, particularly in the secondary school 
and university; also for keeping teachers up to the mark. Indeed, 
some would maintain that the prospect of having to do examinations 
must still be employed as one of the chief motives in education, even 
if their results are entirely worthless. Clearly, then, examinations are 
an essential feature of the whole educational system which relies 
largely upon extraneous motives, such as competition, fear of 
punishment, and blame. And they can hardly be discarded until the 
more progressive ideal of appealing to the pupils’ or students’ 
spontaneous interests becomes a practical reality. 

6. Many American writers (c.g. Kandel, 1936) believe that the 
only legitimate function of examinations should be that of guidance. 
By checking up the results of his instruction the teacher can see 
where he has failed to make some topic clear, and can improve his 
methods. By examining a pupil’s accomplishments he can both 
diagnose the pupil’s weak points which require special attention, and 
can direct his endeavours along the lines for which he is most suited. 
Secondary schools and universities should apply examinations to 
entering candidates, not so as to keep out all but a select few, but to 
discover the fields of work in which each candidate is likely to do 
best, and should dissuade from enteiing only those who could 
profit from no kind of advanced study. 

Unfortunately, the traditional written examination fails very 
badly in its attempts to fulfil these many different functions. We will 
consider first the criticisms which have been levelled against it, and 
later ask whether other measuring instruments would serve the 
purpose better, or whether its adequacy can in any way be improved. 

Criticism of the Ejjects of Examinations upon Education , — The 
first and commonest complaint is that examinations dominate and 
distort the whole curriculum. Although only about 5% of the 
population reach the university, every grammar school pupil has to 
study for the General (or the Scottish Leaving) Certificate as though 
his ultimate objective was a university degree. The schools cannot 
afford to educate in the broader sense, nor to give the majority of 
pupils such knowledge as would be of most use to them in after 
life, because they must work towards examinations concerned 
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almost exclusively with formal, academic, subjects. Again narrow 
specialization is apt to occur in grammar and public school's, for the 
sake of university scholarships. 

Even more widespread are the bad effects of the 1 1 | secondary 
school entrance examination. Subjects or activities which do not 
contribute directly to passing the tests are generally neglected in the 
last year or two of primary schooling. In the larger schools, children 
are streamed even at the infant stage into classes thought likely to 
pass or fail. Homework and coaching out of school hours arc intro- 
duced at too early an age — more often, be it noted, at the behest of 
the parents than of the teachers. 

Furthermore, these examinations stimulate an unhealthy, com- 
petitive spirit in children, and at all ages they encourage the cram- 
ming of set books and rote memorization. Doubtless they a(;t as an 
incentive (function No. 5) but not an incentive to learning for its own 
sake. Indeed, examinations are an inetlicient type of incentive, since 
most pupils and students regard them as too remote to bother about 
until a few weeks or days beforehand. A few candidates, however, 
take them much too seriously, or are forced by their parents and 
teachers to over-work ; with the result that many of the maladjusted 
children who come to Psychological Clinics for advice and treat- 
ment show signs of this examination strain. 

An additional excuse is often advanced for this state of affairs, one 
which is based on the old fallacy of formal discipline. It is said that 
although the passing of examinations in academic subjects is of no 
direct value to the vast majority of the population, >et the pupils’ 
minds are trained thereby so that they become better able to learn 
other subjects, to reason, to write good English, and so forth. A 
large body of experimental evidence not onl> pro\es that this spread 
of training in one field of study to other fields is usually extremely 
limited in extent, but also suggests that the \ery opposite may occur. 
The system may indeed train pupils and students in the best methods 
of passing examinations in other subjects, but it is more likely to 
stunt ‘reasoning power’ owing to its emphasis on mechanical 
memorization. Often it produces a dislike of everything associated 
with school, and so discourages the desire for further education. It 
does not help to develop writing of good English, since the English 
style in which most examination answers are written is known to be, 
on the average, markedly inferior to the style in which the same 
pupils can write compositions at leisure. 
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THE MEASUREMENT OF ABILITIES 


INADtQUAlh KELIABILITY OF EXAMINATIONS 

We will not, however, dwell on these points since, for the purposes 
of this book, it is a more serious matter that most examinations are 
very defective methods of measuring the achievements or the future 
promise which they set out to measure. Their chief, though not their 
only. Haw is inadequate reliability. The following facts are well 
established ; 

(a) When examinees sit two dilTerent examinations dealing with 
the same scholastic subject, both set and marked by the same 
examiner, they are likely to obtain distinctly dilTerent marks on the 
two occasions. 

(/;) When a sijigle set of scripts is marked by two dilTerent 
examiners, the resulting marks are liable to show wide discrepancies. 

The main causes of poor reliability are: 

(1) Actual changes in tlie examinees' mental and physical states 
which affect their examination performances. 

(2) Inadequate sampling of the examinees' knowledge and ability 
by the particular questions set. 

(3) Inconsistencies in the standards of marking adopted by 
different examiners, or by the same examiner on different occasions. 

(4) Differences of opinion between dillerent examiners regarding 
the relative merit of the examinees' answers. 

1. Changes in the Examinees . — This was described above (p. 12S) 
as 'function lluctuation.' Two closely parallel examinations sal on 
consecutive da>s will always give somewhat divergent results, be- 
cause some examinees will be feeling more lit, or in a better fiaine of 
mind, at the first of the two, others at the second. Many external and 
irrelevant happenings may temporarily depress or improve the work 
of a few individuals. Ovei long intervals the examinees' abilities will 
naturally show considerable developments and alterations. On the 
whole, however, the uncertainties occasioned by such factors are 
probably smaller and less important than those due to the other 
three causes. 

2. Inadequate Sampling — This factor is responsible for the com- 
mon complaint among examinees, namely, that the questions ‘did 
not suit them,’ that on another set of questions they might do better 
(or worse). A pupil’s capacities in the various special branches or 
aspects of a subject are always somewhat uneven. And no examina- 
tion is capable of bringing out everything that the pupil can do in 
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that subject, or of measuring all his strong and weak points. Hence, 
the half-dozen or so questions which he answers are never more than 
a sample of the questions which would be needed to cover the whole 
field. Common sense, as well as statistical principles, tell us that 
the more lengthy the examination, and the greater the number of 
questions, the more representative will the sample be, and the better 
will the ground be covered. Thus in certain university honours 
examinations where the students’ marks are based on twenty-four 
hours of examination work, their marks would be unlikely to alter 
to any great extent if they sat another similar scries of papers. But 
when an examination consists, say, only of one short paper in 
English and one in arithmetic, quite large alterations might easily 
occur in further parallel papers. The correlation between results on 
ordinary 1- or 2-hour papers in the same subject depends con- 
siderably on the heterogeneity, or range of ability, amcflig the 
examinees fcf. p. 122): but a fairly typical figure is 1- 0-70. and it is 
often less than this. We can therefore predict, by applying the 
Spearman-Brown formula fp. 127), that examinees would require to 
take at least four such papers before the ground vs as sufiiciently 
covered for the reliability coeliieicnt to rise to the fairly respectable 
figure of I 0-90. 

Suppose that one hundred pupils take such an examination, and 
that the passing mark is fixed so that half of them pass. Now if they 
take a second similar paper which correlates ' 0-69 with the first, 
we can predict that only 27 of the pupils would pass on both, and 
37 fail on both, and that as many as 26 would pass the first and fail 
the second, or vice versa. This hypothetical but only too typical 
illustration demonstrates the unfiiirness which inevitably results 
from the lack of thoroughness of many examinations, and in par- 
ticuhir the difiicultics which arise when the examination is supposed 
to divide the candidates \n\o two distinct categories at an arbitrary 
pass mark. 

An additional reason for the variability of a pupil’s level in 
answering different questions or sets of questions is that the majority 
of examination answers consist of English compositions or essays. 
(This docs not usually apph to papers in mathematics or in foreign 
languages; but it docs to English, history, geography, science, and 
other subjects.) And the English essay is known to be a particularly 
variable production. The compositions written by a pupil weekly 
throughout a school year fluctuate widely in their merits. As Ballard 
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(1923) points out, teachers may refuse to accept this statement, and 
claim that their pupils are fairly consistent from week to week. But 
the fact that a teacher marks any one pupil’s work on a dead level 
probably indicates that, unwittingly, he has formed a stereotyped 
conception regarding his ability, and is marking the successive 
compositions not on their own merits but in the light of his know- 
ledge of the pupil’s average achievement. The truth of this suggestion 
is borne out by the many stories of the two pupils who regularly 
received, one a good, the other a bad, mark, and who continued to 
do so even when they exchanged their essays. 

An outside examiner does not possess this knowledge of the 
pupil’s average work, and so will give poor marks to those who 
happen to write answers much below their normal level, good marks 
to those who rise above it. Moreover, under the stress of examination 
conditions, such deviations are likely to be exaggerated. Probably 
much more representative products would be forthcoming if the 
examination could be answered at leisure, under conditions similar 
to those which apply when pupils do written work at home, or at 
school. The common practice in Civil Service, and certain other, 
examinations of using a single essay as an indication of the candi- 
dates’ all-round literacy, cultural background, and ‘quality of mind’ 
is particularly deplorable, as pointed out by Vernon and Millican 
(1954) in an investigation of students’ essays on several different 
topics. 

3. Divergent Scales of Marks . — ^This factor has already been dis- 
cussed fully in Chaps, ligand TV, hence we will not pause to rc-argue 
the illogLality and injustice of a system which permits a far higher 
proportion of passes, or a higher average mark, to be awarded in 
one university or General Certificate subject than in another. If a 
mark of 80%, or a first class or credit, signifies to one examiner the 
standard achieved by a quarter of the examinees, and to another the 
standard reached by only 1 % of the examinees, then it is obvious 
that this mark means different things to different people. And it 
follows that an examinee’s mark will vary with the particular ex- 
aminer who happens to read his papers, unless all the marks given 
by all the examiners are uniformly standardized in accordance with 
the principles outlined above. 

The standards of marking even of a single examiner are liable to 
alter with fatigue and mood. When marking large numbers of answers 
to the same question, he may begin very strictly and gradually 
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become more lenient, or vice versa. The same script might receive 
a different mark if read after, instead of before, dinner. 

4. Differences in Opinion as to Relative Merit . — The most impor- 
tant source of unreliability (which is usually found in combination 
with No. 3) is the personal element which enters into every ex- 
aminer’s evaluation of a script. We will first summarize some of the 
evidence regarding this subjectivity. The classical experiments were 
carried out in America as far back as 1912. Starch and Elliott (1913) 
sent copies of a single geometry paper to the chief geometry teachers 
of 116 high schools, with the request that they should mark it in 
accordance with their usual practice. The marks awarded varied all 
the way from 28% to 92%. Two of the teachers gave it more than 
90%, 18 gave it between 80% and 90%, 18 gave it between 60% and 
30%, and 2 under 30%. Similar results were obtained with papers in 
English and in history. Equally striking is the anecdote related by 
Wood (1921) of the papers which were marked in the usual way by 
six examiners. The first examiner, for his own guidance, wrote out a 
set of (what he regarded as) model answers to the questions ; but he 
unfortunately left this among the candidates’ papers. He was subse- 
quently awarded marks varying from 40% to 90% by the other five 
examiners! Many experiments, conducted under far more strictly 
controlled conditions than these, have substantially confirmed their 
findings. Particularly notcwoithy arc the elaborate studies carried 
out in England and France in the 1930s, under the auspices of the 
International Examinations Enquiry Committee. 

Hartog and Rhodes (1935) arranged for typical groups of scripts 
to be marked by experienced examiners under normal examination 
conditions. Special place, School Certificate, and university honours 
examinations were represented, and several different subjects were 
studied. Generalizing from a number of their results, it appears that 
when half a dozen examiners independently mark a set of scripts, the 
lowest and highest marks given to any one script dilTer on the average 
by about 10 to 12%. Among a few of the scripts the discrepancies 
may be quite small, but among others they may be far larger, often 
greater than 20%. The authors show that part of the divergence may 
be ascribed to the different standards of leniency adopted by different 
examiners (No. 3, above), but that even when this factor is elimin- 
ated, large variations remain. In several of the experiments the 
examiners drew up an agreed scheme of instructions for marking, 
listing both the general characteristics and the detailed points which 
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were to be looked for 5n the scripts. In the marking of English com- 
positiorvs, where this was scarcely possible, the inconsistencies were 
much greater. 

Now Hartog and Rhodes doubtless desired to shock the public 
conscience with regard to examinations, and so did their best to 
emphasize the disagreements.' The present writer is more impressed 
by the smallness than by the largeness of the discrepancies, and 
would conclude from his study of the results that important public 
examinations, like the School Certificate and university honours, are 
marked much more consistently than is usually the case. Instead of 
laying their chief emphasis on the extreme discrepancies, the authors 
might have pointed out that the median disagreement between any two 
examiners is not more than 3% in the best-conducted examinations. 
Since this figure represents the Probable Error of an examinee’s mark, 
a few examinees will naturally show discrepancies four or five limes 
as large. It represents an average correlation between different 
examiners of -f 0-80, when the standard deviation of their marks is 
10%. And decidedly lower figures have been obtained by other in- 
vestigators. The authors also omit to point out that the reliability of 
an examinee’s final total mark is generally very much higher than 
their results indicate, since it may be based on lialf a dozen or more 
papers, each marked by two or more examiners. Their resuUs mostly 
refer to single three-hour, or two two-hour, papers. 

Nevertheless, this study deserves to be taken very seriously, since 
it indicates that many less thorough examinations are deplorably 
unreliable; that in the absence of a scheme of instructions drawn up 
and applied by experienced examiners, much worse discrepancies 
may arise; that when the average and the dispersion of marks are 
not standardized, gross differences may appear in the proportions of 
Credits, Passes, Fails, etc., which are awarded ; and that even a P.E. 
of 3% may frequently make all the difference between a Pass and a 
Fail, or a First and a Second Class. 

Another example will be quoted which better illustrates the sub- 
jectivity of marking under ordinary school conditions. Boyd (1924) 
collected large numbers of essays on ‘A Day at the Seaside’ from 
Il-year-old Scottish school-children (i.e. children at the qualifying 
stage). From these he selected 26 as representing the whole range of 

' As most reviewers of An Examination of Examinations pointed out, some of 
the experiments were rendered quite futile by the selection of unduly homogene- 
ous sets of scripts for the examiners to mark. 
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merit, and had them marked independently by 271 teachers. The 
marks were not numerical, but consisted of seven grades,* ranging 
from Excellent to Unsatisfactory. On comparing the results of 
different markers, it was found that almost every essay received at 
least 6 of the 7 possible grades. The following instance is typical : 3 
teachers marked it Excellent; 31 VG 1 , 80 VG, 125 G +, 28 G, 
and 4 marked it Moderate (the lowest grade but one). Boyd shows 
that part of the discrepancy was due to differing standards. But even 
when the 20% most lenient markers and the 20% most severe were 
excluded, the average essay was still awarded at least 4 out of the 7 
grades. He also points out that the average grade given to an essay 
by a large number of markers is highlv consistent, and that the 
various divergent individual grades are distributed around this 
average more or less in accordance with the normal distribution 
curve. Actually it has been shown by Wiseman (1949) that a reason- 
ably reliable mark for English composition can be obtained at the 
11 1 selection stage (since the range of ability is very much wider 
than it is among highly selected grammar school pupils or university 
students), provided that four examiners’ marks arc combined. Tt is 
desirable also that each candidate should write on two or more 
topics. The burden of marking is not so heavy as might appear, since 
only the essays of some 10 to of borderline candidates need to 
be considered ; and rapid general impression marking at the average 
rate of one minute per script is adequate 

Other experiments, including one of Tlartog’s, indicate that w'hen 
the same examiner re-marks a set of scripts after an interval, the 
discrepancies between his first and second opinions are likely to be 
almost as great as those betw^een twai independent examiners. Some 
of the studies by the French members of the International Examina- 
tions Enquiry arc also worth quoting. A set of papers in science w’ere 
marked by three trained scientists (one of them doing it twice over), 
and also by a student who had no scientific training. When the mark- 
ings were compared, the average correlation coefficient between the 
scientists was \- 0-63, whilst the correlations between the student and 
the scientists averaged -| 0*51. The latter figure is lower than the 
former, but so little lower that it casts some suspicion on the value of 
an examiner’s knowledge of his subject. 

Particularly striking was a study of the philosophy examination in 
the Baccalaureat. One hundred scripts were marked by six examiners. 
Only 10 of the candidates were passed by all the examiners, only 9 
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failed by all of them. The remaining 71 were passed by some, failed 
by otheFs.^ It is, of course, always the examinees of medium ability 
whose true standing is most uncertain. The few best or few worst are 
recognized as such by all examiners. But only too frequently an 
arbitrary dividing line — the pass mark — has to be erected in the 
middle of this region of maximum uncertainty. Undoubtedly, there- 
fore, a large number of candidates who just succeed, or just fail to 
gain a General Certificate pass or a scholarship might very easily 
obtain the opposite result if they were marked by a dilTercnt examiner, 
or if they sat an alternative set of papers. In many instances their 
whole educational or vocational career may be blighted owing to 
the unreliability of their marks. 

Reasons for the Subjectivitv of Marks . — The prime reason for the 
discrepancies which we have described is that different markers con- 
ceive (differently the desirable or undesirable characteristics of exam- 
ination answ^ers. In other words, they attempt to assess many diverse 
types of abilities. One examiner may give chief credit for evidence of 
work done, and for the comprehensiveness and accuracy of the facts 
contained in the answers. Another may look rather for signs of 
future promise, originality, and grasp of general principles. One may 
insist on clarity of expression of ideas, another may try to assess the 
profoundness of the ideas irrespective of their good or bad expres- 
sion. Some may search for emotional rather than strictly intellectual 
qualities, such as the examinee’s interest in the subject. Whenever a 
candidate states his attitudes or opinions, these arc liable to coincide, 
or clash with, the attitudes or opinions of the examiner ; and how'cvcr 
desirous the examiner may be of maintaining impartiality, he is liable 
to bias if his pet theories arc approved or attacked by the candidates. 
Often, if he would take the trouble to formulate explicitly his notion 
of the aim of the examination, he would find that he means by a 
good examinee the one who closely approximates to himself, and as 
a poor candidate the one who shows none of the intellectual or 
emotional qualities which he himself idealizes. Is it to be wondered, 
then, that different examiners differ in their opinions of a candidate? 

We must recognize that any one script may manifest an extreme 
complexity of features, good and bad, and so can be looked at from 
a multitude of different points of view. Its merit can never be assessed 
in the same way that a physical object is measured. An examiner 

^This result suggests that the correlation between any pair of examiners 
averaged only -f 0-60. 
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ordinarily splits it up into its chief components, notes the facts which 
it contains or omits, its elements of originality, style of expression, 
the mechanics of its grammar or computations, and so on. He 
attaches a certain weight to each of these possible components (a 
weight which is likely to be highly individualistic, unless a detailed 
scheme of marking has been adopted), and finally re-synthesizes the 
candidate’s performance on each component into a single mark for 
the answer as a whole. Some examiners do not attempt a systematic 
analysis of this kind, and merely mark on the basis of a general 
impression. But as Ballard (1923) and Hamilton (1929) show, every 
general impression involves a more or less vague analysis. 

Even in such subjects as arithmetic, o^ the translation of foreign 
languages, the weightings assigned to the various analysablc features 
of an answer, and the penalties given to various possible mistakes, 
are liable to differ among different markers. But the complexity is far 
greater in the case of English composition, or in those subjects such 
as history, science, etc., where the answers consist chiefly of com- 
positions. The essay-type of answer has been very widely, and justi- 
fiably, criticized. Examinees have to waste a large proportion of the 
available time in translating their historical or scientific knowledge 
into readable essay-form, and in the mere process of writing. This 
latter is particularly unfair, because some can write much faster than 
others without necessarily being better historians or scientists. A 
considerable proportion of the written words (for instance, the a's, 
the’s, and’s, etc.) add nothing to the evidence of the examinee’s good 
or poor ability. Investigation shows that the average examinee puts 
into writing less than one fact or idea per minute of examination 
time, since the process of expressing tlicse facts or ideas takes so 
long. The examiner, then, has the still more dilhcult task of transla- 
ting the product back again, of trying to penetrate through the 
verbiage to the signs of ability and knowledge beneath. Few exam- 
iners can remain entirely uninfluenced by the literary style or the 
handwriting of an answer (there is direct experimental proof of this), 
although they arc presumably trying to mark the product for its 
historical or scientific merit, not qua essay. 


inadequatp: validity of examinations 

The poor reliability of examinations greatly reduces their value for 
any and every purpose. For if they cannot even measure examination 
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success without serious errors, they naturally cannot validly predict 
anything else. Even, however, if perfect reliability were established, 
validity would still be diminished by the dependence of examination 
success on factors other than ability, knowledge, and industry. 

One such factor is, obviously, the efficiency of the examinee’s 
teachers as examination coaches, and their cleverness in guessing the 
kinds of questions which examiners will set. It is manifestly unfair 
that a pupil at one school, where the teachers arc specially adept at 
instilling examination knowledge, should succeed when a pupil of 
equal ability fails because he attends another school, where the 
teachers are either less skilled, or more humane in the education 
which they provide. 

There is also the old complaint that some pupils arc ‘good exam- 
inees,’ while others become unduly ‘nervous,’ and ‘fail to do them- 
selves 'justice.’ One wonders whether it is not usually the weaker 
candidates who make this complaint, and whether they would do 
better if subjected to any other kind of lest, including the tests of real 
life. However, we may admit that certain temperamental qualities, 
which we can as yet liardly specify, together with good physical 
health, may confer some advantage on certain candidates, which 
other candidates lack. 

Examinations, then, certainly fail to fulfil the first two functions 
with which wc started, namely, the measurement of achievement, 
both because of poor reliability and because of the other considera- 
tions just mentioned. How imperfect they arc wc cannot determine. 
Wc can, however, study their validity as measures of future ability 
by directh' correlating their marks with the results of the same pupils 
at later stages in their careers. Comprehensive investigations into 
this matter were carried out by Valentine and Emmett fl932). with 
almost uniformly disheartening results. By comparing the subsequent 
secondary school achievement of groups of pupils who had been 
awarded, or who had failed to obtain, scholarships at the time of the 
special place examination, it was found that many of the latter 
ultimately surpassed the former group. Similarly, at the university 
level, Valentine found that roughly two-fifths of all those awarded 
scholarships failed to justify their promise, since they eventually ob- 
tained only third-class honours or pass degrees. They were beaten 
by as many as one-third of the non-scholars, who achieved first- or 
second-class honours. Ofthose who won scholarships given by the 
universities, only about one-fifth failed to live up to expectations, of 
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State scholars only one-tenth. But this meant that practically one- 
half of those who obtained other types of scholarships (from their 
schools or from local authorities) did not justify themselves (cf. 
Valentine, 1938). 

In a further study by the Scottish Council for Research in Educa- 
tion (1936), marks gained by secondary school pupils at the Leaving 
Certilicate examination, together with teachers’ maiks and the head 
masters’ estimates of the pupils’ general standing, were correlated 
with subsequent class marks and degree marks at the universities. It 
WU'. found that on the whole those who got iirst-class honours had 
been rated higher by the school head masters than those who got 
second class, and that honours students liad been rated higher than 
ordinary degree students; but that, as in Valentine’s reseaieh, there 
were a great many exceptions. \ he actual correlation coeHicicnts 
between school and univeisily marks varied from about , 0-30 to 
1 0*70, and I he average of several such results was , 0-48. In the 
light of our discussion ol' correlations (Chap. VII), this ligure im- 
plies very poor predictive validity. 

It should be remembered, however, that the above investigations 
(together with most of the studies of reliabilil}) dealt with highl} 
selected groups. At the 11 i selection stage praclicallv the whole 
range of children is examined, and more reliable tests than the 
conventional subjective!) marked papers aie generally tised nowa- 
days. fhe result is that this examination gives coi relations around 
. 0 85 with performance in secondary school work 2 to 5 years later 
(Emmett and Wilmut, 1952). Moreover, m view of the imperfect 
reliability of tiie school marks used as a ciiterion of achievement, the 
true ligure should be ol the order of , (J-90. ^ et even this coellicient 
implies that many mistakes are made m selection. If 20 out of 100 
aie admilled to grammar schools, 5 that is one-quarter - arc likely 
to tuin out unsuitable, and 5 who luive been rejected and sent to 
modern schools would have surpassed them. 

It would be unreasonable to expect perfect correlations between 
marks at successive stages in pupils’ careers, since the subjects 
studied at a secondary school diller distinctly from those studied at 
a primary or preparatory school, even when called by the same 
names. And the conditions of work at a university diller from those 
at a school, being more congenial to some, less stimulating to 
others. Again, every individual changes to some extent as he grows 
older, some of his abilities becoming relatively weaker, new ones 
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developing. All in all, then, the inability of entrance and scholarship 
examiniitions to choose those candidates who will profit most from 
higher education is not very surprising to the psychologist. Yet their 
validity for this purpose is certainly much poorer than is generally 
supposed by the educational authorities who set the examinations, 
and who base their awards and selections upon them. The universi- 
ties would scarcely continue to dictate the present constitution of the 
General and Leaving Certificates, and Matriculation examinations, 
if they did not believe that these instruments picked out the most 
suitable entrants. 

Still more unjustified is the belief embodied in the fourth of the 
functions outlined above (p. 196), namely that these examinations 
are adequate criteria of vocational fitness. It is indeed true, as em- 
ployers assert, that successful candidates will on the aveiage be 
better in general all-round ability than those who have failed to gain 
the Certificates, since all scholastic abilities arc to some extent 
positively correlated with intelligence. It is probably true, also, that 
vocationally desirable temperamental and character qualities may 
aid in examination success. But so do a great many other factors. 
Some pupils with extremely undesirable characters may do well at 
examinations through intellectual brilliance; others through efiicient 
coaching on the part of their teachers, or tlirough over-cramming; 
still others through the pure chance factors introduced by the 
subjectivity of marking and other forms of unreliability. The con- 
clusion follows, then, that success or failure may depend on so 
many diverse qualities, that it does not provide trustworthy evidence 
of any of them. The employer would be far better advised to make 
use of the services of a scientific vocational psychologist, who would 
measure or obtain the best available assessments, both of intelligence 
and other aptitudes, of educational achievements and of tempera- 
mental traits, and would give each of these qualities its due weight 
in deciding on the suitability of prospective employees. Undoubtedly 
the adoption of a similar scientific study of each individual candidate 
for secondary or university education would lead to many more just 
selections, and more accurate predictions of futui e achievemem, than 
do the present methods (cf. Dale, 1954). 

Although this chapter has consisted mainly of destructive criticism, 
certain hints have been given as to possible ways and means of 
improving examinations. These will be followed up, and expanded 
in the next chapter and in Chap. XIV. 



CHAPTER XII 


RELATIVE ADVANTAGES OF NEW-TYPE AND 
ESSAY-TYPE EXAMINATIONS 

It has been widely claimed that the worst defects of the traditional 
type of written examination may be overcome by adopting in its 
place the objective or new-type test for measuring achievement. 
Such tests came into common use in the United States many years 
ago, but are still little known in this country, in spite of Ballard's 
(1923) able advocacy. Americans, however, realize now thc\t they 
also are liable to certain defects when used undiscriminatingly, and 
that there are still functions which can better be performed by the 
old- or essay-type of examination. Thus we are in a position to profit 
both from their researches and from their errors, to devise our tests 
in accordance with the principles they have already worked out, and 
to apply them to the fields of educational measurement where they 
will be most useful and most valid. 

Now it is clear that some educational products can be marked 
much more objectively and reliably than others. In the early stages 
of school arithmetic, when children arc given a scries of short sums 
to do, there is no subjective clement invoked in deciding which ones, 
and how many, they do correctly. But before long arithmetic be- 
comes more complex; each sum involves a numbei of dilTerent 
operations, partial credits arc essential, and different markers arc 
now liable to differ. An English composition, or aii examination 
answer written in essay form, brings in a still greater personal ele- 
ment, since it is still more complex and cannot c\cr be considered 
merely from the point of view of right or wrong. The nevv-l>pe 
tester argues, therefore, let us refrain from setting sums or essay 
questions which involve many operations, and instead ask questions 
which evoke each operation separately, questions to ^\hich only one 
right answer is possible. And instead of leaving to the marker’s 
personal taste the analysis which must be made of a complex product, 
and the weighting which must be attached to its component ele- 
ments, let the person who sets the questions make the analysis 
beforehand, and decide systematically just now many marks (i.e. 
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how many questions) shall be allotted to each element. 

Advantages of New-type Examinations . — So we arrive at the new- 
type examination, which contains a large number of brief questions 
instead of a few long ones. Every question is so arranged that all 
markers will agree as to the rightness or wrongness of an answer. 
This elimination of the personal element is not the only advantage. 
For the hundred or more questions of the new-type test arc capable 
of sampling the whole field of knowledge much more compre- 
hensively than can the half-dozen or so questions of the orthodox 
examination. Candidates can no longer complain that the particular 
questions failed to suit them, since all their strong and weak points 
are probed. This fact should help to reduce the ‘spotting’ and cram- 
ming of likely questions. Again, the subjectivity of standards of 
marking need no longer cause trouble, since the examinees’ marks 
are pure ‘count scores’ (cf. p. 20). So long as the questions are de- 
signed to cover all levels of difiiculty evenly, the distribution of marks 
is sure to approximate to normality, and can readily be converted 
into any desired percentage or other scale. The examiner may, of 
course, draw an arbitrary pass line at any point in the distribution 
that he wishes, so as to cut oil' the examinees who fail to come up to 
his minimum standards. 

A full description of the various sorts of questions which may be 
employed in new-type tests is given in the next chapter. Here it will 
be sufficient to distinguish between the so-called ‘open’ or ‘recall’ 
type of question and the ‘closed’ or ‘recognition’ type. In the former 
the desired answer consists of a single word, or a few words, or 
numbers, which the examinee must write either in a space provided 
on the examination paper itself, or on a separate sheet. Questions 
1~16 on pp. 213-14 are of this type. Since, however, there is always 
some danger of partially right or alternative answers being given, 
whose marking would involve subjective judgment, the other type of 
question is more commonly adopted, where the examinee has to 
identify the right answer from among two or more answers printed 
on the examination paper. Either he may underline or put a check 
beside the answer he believes to be correct, or else merely write 
down its letter or number. Questions 17-54 on pp. 215-19 are of this 
type. 

This latter system has the additional advantage that practically all 
writing is eliminated. So that when geography or history is being 
examined^ irrelevant factors such as speed of writing or literary style 
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do not distort the examiner’s marking. It has been claimed that the 
average examinee, instead of expressing less than one idea or fact 
per minute (as in an essay-paper), can deal with three to six new-type 
questions a minute. But this naturally depends on the level of diffi- 
culty of the questions. A further advantage is that the examinee who 
knows scarcely anything about the subject, but who bluffs by ‘free- 
associating’ around an essay-type question, and who often gains 
marks by doing so, is entirely foiled. Older students realize how much 
fairer is the new-type test, and on the whole prefer it to the orthodox 
examination. Younger pupils, also, according to several American 
investigations, enjoy it more. 

A short test is appended below, dealing with the subject-matter of 
the previous chapters in this book, which the reader is advised to 
try to answer. It is not meant to be a complete examination on 
mental testing and statistics (hence it violates several of the rules of 
construction given in the next chapter). But it introduces several of 
the main varieties of new-type questions, some easy, some fairly 
difficult, and should give the reader who is not already familiar with 
such questions a better insight into them. Nos. 1^8 arc simplc-recall 
type, Nos. 9-16 completion type, Nos. 17-25 true-false type, Nos. 
26-36 multiple-choice type, Nos. 37-47 matching type, and Nos. 
48-54 rearrangement type. The correct answers arc listed on 
p. 258. 

EXAMINATION ON MENTAL MEASUREMENT AND STATISTICS 
Time: One Hour 

Please read the following directions carefully before starting: 

Answer as many questions as possible, in any order. The answers 
should be written in the right-hand margin on the blank spaces 
provided. Any rough liguring may be done on blank paper. 

The answer to every question (except Nos. 42-7) consists of one 
letter, word, or number. 

You may guess if you arc not certain of an answer; but as marks 
will be deducted for wrong answers, random guessing is not 
advisable. 

1. Eight pupils obtained the following school marks: 

ABCDEFGH 
15 10 12 10 9 12 10 18 

What is their median? . . . . 

2. If the above marks were put into rank order form, what would 

B’s rank position be? . . . . 

M.A. — 15 
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3-5. The following is an extract from the norms for a group intelli- 
gence test : 

Score . . 48 58 68 77 84 89 93 

Mental Age. 9 10 11 12 13 14 15 and adult 

What arc the respective intelligence quotients of : 

3. A child aged 8 0 who scores 77? . . 

4. A child aged 10*0 who scores 63? . 

5. An adult aged 20*0 who scores 77? 

6-8. In a certain secondary school all the marks are transposed into 

the follow'ing standard scale. A pupil obtains marks of 60%, 
47%, and 63% respectively in English, French, and Mathe- 
matics. In his English class there arc 41 pupils, in his French 
class 35, and in his Mathematics class 44. 


Percentile Mark 

90 . . . 73% 

75 . . . 67% 

50 . . . 60% 

25 . . . 53% 

10 . . . 47% 


What is his approximate place (i.c. his rank position in the 
class) in : 

6. English? . . . . . . 

7. French? . . . . . . 

8. Mathematics? . . . . . 

9-16. In each of the numbered spaces in the following paragraph, 

one word or letter is needed to complete the statements. Do not 
write anything in the actual spaces, but opposite No. 9 in the 
right-hand margin write the word or letter which is needed to 
fill up space (9). Fill in the other words or letters similarly. 

The compensation theory of mental or- 
ganization is disproved by the fact that all q 
correlations between mental abilities arc (9). 

The results of correlational investigations \o 
also disprove the view that the mind is made 
up of distinct (10). The Two-factor Theory, u 
proposed by (11), regards each ability as 
made up of a factor common to all abilities, J2 
called (12), and a factor (13) to that ability 

alone, called s. However, when verbal tests, 13 

performance tests, and other tests of intelli- 
gence are compared, it is doubtful whether 14 

(14) will account for all the inter-correlations. 

Residual correlations indicate the presence 15 

of (15) factors, such as verbal ability, v, and 

practical ability (16). 16 

17-25. Write a + sign in the margin after any of the following 
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statements which arc true, and a — sign after those which are 
false. 

17. An omnibus test is a group intelligence test 

in which items of many kinds are mixed up 

18. A majority of very dull children arc below 

average in health and physical develop- 
ment . . . . . 

19. Performance tests arc applied by psycholo- 

gists chiclly in order to measure a child’s 
practical ability in everyday life . 

20. In order to measure the average iiUelligciicc 

of a class of 7-year-old children, a teacher 
would be best advised to use a group test 
with pictorial material . . . 

21. In a normal frequency distribution curve the 

semi-interquartile range and the standard 
deviation arc identical . 

22. University students who obtain scholarships 

on the basis of competitive examinations 
as often as not obtain no belter degree 
marks than those who fail to win such 
scholarships . . . . . 

23. I’hc advantage of scoiing abilities in terms 

of Mental Age or Hducalional Age is that 
the units of measurement are all equal to 
one another . . . 

24. A group of a bundled children picked out as 

being especially superior in one school 
subject will practically never show as great 
an amount of average superiority in any 
other subject . . . . 

25. The Probable Error of a pupil’s E nglish com- 

position mark is general Iv likely to be 
smaller than that of his arithmetic mark 

26. Write the letter (A, B, or C\ etc.) of the correct answer in the 
margin. 

26. CiroLip intelligence tests, as contrasted with 

the Stanford-Binet test, are: . . 

(A) More varied in content 

(B) Better measures of children's intelligence 

(C) More objectively scored 
(O) More dilliciilt to apply 

(E) Less suitable for application to superior 
adults 

27-32. A set of 51 pupils took three examinations, A, 

B, and C. The following distributions of marks 
were obtained : 
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Mark 

A 

B 

C 

87 + . 

1 

1 

3 

80 -j- . 

3 

2 

5 

73 + . 

4 

4 

7 

66 + . 

. 18 

7 

11 

59 + . 

. 17 

11 

11 

52 + . 

5 

14 

8 

45 + . 

2 

9 

4 

38 + . 

1 

3 

2 


51 

51 

51 


27. Which of the distributions approximates 

fairly closely to normal. A, B, C, or 
None? ...... 

28. Which of the distributions is negatively 

skewed. A, B, C, or None? 

29. Which of the distributions is likely to 

have the smallest standard deviation? 

30. A certain pupil happened to get 70 % on 

all three examinations. In which of 
them did he do relatively best? 

31. If you were calculating the average of 

Distribution C, what figure would you 
take as your arbitrary mean? . 

32. In calculating this same average, what 

figure would be multiplied by 2 (i.e. 
by the number of pupils who score 
38 +)? 

33. A group intelligence test was applied in a 

secondary school, and the occupational 
preferences of all the pupils were ascer- 
tained and classified under ten headings. 
In order to find whether there is any cor- 
respondence between occupational prefer- 
ence and intelligence, which of the follow- 
ing statistical techniques could best be 
applied? ...... 

(A) Biserial r. 

(B) Product moment correlation. 

(C) Analysis of variance. 

(D) Contingency coefficient. 

(E) Correlation ratio. 

(F) Multiple correlation. 

34. Which is the least serious of the following 

defects in intelligence quotients obtained 
from group intelligence tests? . 



NEW-TYPE AND ESSAY-TYPE EXAMINATIONS 

(A) Group test scores are considerably affec- 

ted by previous practice on similar tests. 

(B) The standard deviation of group test 

I.Q.s differs considerably in different 
tests. 

(C) Few of the published tests are adequately 

standardized, or have reliable norms. 

(D) Group tests are generally given with a 

time limit, and the child who is quick- 
est is not necessarily the most intelli- 
gent. 

35. Which one of the following statements is un- 

true? The distribution of marks given by 
an examiner to a class of fifty pupils may 
be very irregular or skewed because : 

(A) The number of cases is too small for a . 

smooth, normal distribution to be 
expected. 

(B) The examiner’s opinion of the relative 

merits of the scripts is a personal one. 

(C) The pupils’ abilities may be asymmetric- 

ally distributed, as when, for example, 
the class contains a long tail. 

(D) The scale of marks which the examiner 

adopts may be lacking in any rational 
basis. 

36. In an experiment on the scores of graduate 

and non-graduate teachers on verbal and 
non-verbal group intelligence tests, these 
results were obtained: 

Verbal Tests Non-verbal Tests 


Graduates . 

153-9 

78-6 

Non-graduates 

P.E. of the difference 

143-5 

77-0 

between means . 

2-1 

1-2 


Which of the following conclusions may legi- 
timately be drawn? .... 

(A) Graduates do better than non-graduates 

at intelligence tests. 

(B) All teachers do better at verbal than at 

non-verbal tests. 

(C) Scores on verbal tests are improved by 

taking a university degree. 

(D) Teachers with university training have 

no appreciable advantage over others 
on non-verbal tests. 
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37-41. The abo\e pairs of curves (A-D) represent the approximate 
frequency distributions of the intelligence quotients of various 
groups of children. Below are listed scxeral such groups. 
Opposite four of them write the letter of the pair of curves 
which corresponds. Write ‘None’ opposite the groups to which 
no pair of curves corresponds. 

37. One thousand elementary school pupils, and 

one thousand special school pupils 

38. All elementary school pupils and all special 

school pupils in the country 

39. All elementary school pupils and all children 

in the country . . . . 

40. All special school and all secondary school 

pupils in the country 

41. All elementary school and all secondary 

school pupils in the country . . 

42-47. Below is given a list of psychologists (A-F), together with a 
list of important pieces of work in educational psychology. 
After each of the latter write the letter of the psychologist who 
was mainly responsible for, or whom you chiefly associate with, 



NEW-TYPE AND ESSAY-TYPE EXAMINATIONS 219 

this piece of work. Two letters may be written if two psycholo- 
gists were prominently associated. Some letters maybe used 
twice or more, some not at all. 

42. Development of the con- 

ception of the Intelli- 
gence Quotient . 

43. Development of a scale 

of performance tests 

44. Investigations by corre- 

lations into the factors 
underlying abilities 

45. Advocacy of new-type ex- 
aminations. . 

46. Construction of educa- 
tional attainments tests 

47. Re-standardi/ation of the 
Binet-Simon scale . 

48-54. An arithmetic test, a handwriting test, and two group 
intelligence tests were given to an ordinary school class, and 
the following correlations between test scores were worked out : 

(A) Combined intelligence tests with handwriting test. 

(B) Combined intelligence tests with arithmetic test. 

(C) Combined intelligence tests with chronological age. 

(D) One intelligence test with the other. 

(C) Arithmetic test with handwriting test. 

(F) Handwriting test with chronological age. 

(G) One intelligence test w'ith arithmetic lest. 

Which pair would you expect to yield: 

48. The highest correlation? 

49. The second highest? . ... 

50. The third? . . 

51. The fourth? .... 

52. The fifth? . . 

53. The sixth? . . 

54. The lowest correlation? . 


(A) Ballard 

(B) Binet 

(C) Burt 

(D) Drcver-Collins 

(E) Tcrman 

(F) Thurstone 


DHFIiCTS or NPW^-TYPI: FWMINATIONS 
So far only the advantages of new-type examinations have been 
mentioned, and w^e must now consider criticisms of them in detail. 
Two minor objections wall be admitted at once. First, the cost of 
printing or duplicating a new-type test is considerably greater than 
that of an essay-type examination. F\ir classroom purposes, however, 
oral presentation of the questions can often be adopted. If several 
groups of pupils or students are to be examined, they can wTite their 
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answers on blank sheets, and their test papers can be used several 
times over. 

Secondly, much greater time and skill are needed for constructing 
a good new-type test. The major criticisms which are to be discussed 
below apply only too often to inferior tests which have been rapidly 
constructed by inexperienced amateurs. It is highly desirable that 
teachers should be trained in the principles of testing during their 
training college courses. The time needed in setting a paper is, how- 
ever, more than offset by the time saved in marking, when the num- 
ber of examinees is large. Indeed, since the marking should be fool- 
proof, there is no reason why it should not be done by clerks or 
secretaries. In large public examinations, the authorities could save 
much expense by paying professional examiners only to set, not to 
correct^ the papers. The present writer finds that the construction of 
a one-hour’s new-type test in educational psychology (such as that 
appended abo\e) takes him about twenty hours. Setting an essay- 
examination scarcely needs one hour. But in marking these one- 
hour examinations he can easily do thiity of the former as against 
ten or less of the latter scripts per hour. He, therefore, finds no saving 
in total hours of work unless the number of examinees approximates 
to 300 or more. But the setting of a new-type test is a fascinating 
occupation, which can be done in odd moments throughout the 
year; and the marking is simply a loutine matter which involves no 
mental strain. By contrast, the marking of laige numbers of essay-type 
scripts in psychology is the most trying woik that he ever has to do. 

The Guessing Factor we must admit that examinees may 
obtain a :ertain number of coriect answers (m a recognition-type 
test) by pure chance guessing. If they filled in answers at random 
they would, on the average, achieve half-marks on the questions 
where two responses are ^^ovidcd (such as the true-false type), and 
quarter-marks on the questions where four responses arc provided. 
This fact disturbs the teacher far more than it docs the psychologist. 
For the latter realizes that every examinee has the same chance of 
scoring, let us say, 35%, by random guessing; he therefore regards 
35% as the effective zero point and only gives credit to scores above 
this level. An alternative procedure, which comes to much the same 
thing, but is fairer in that it does not penalize the examinee who omits 
questions rather than guess them, is to subtract the total wrong 
responses in true-false questions from the total right ones; to sub- 
tract half the total wrong in three-response questions, one-third in 
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four-response questions, and so on. That is to say, an examinee*s 

W 

score corrected for guessing = R — , where R is the total right, 

W the total wrong, and n is the number of alternative responses 
provided in each question. In general practice the correction is 
omitted in four- or more response questions, since it is so small.* 

But the problem of guessing is a complex one which cannot be 
entirely settled by this mechanical correction. It has aroused wide- 
spread controversy, and a long discussion of it may be found in Ruch 
(1929). Although the above correction will compensate for the 
effects of guessing in the average examinee, yet it will not do so in 
all cases, since some will be luckier than others, in accordance with 
the principles of probability. Suppose that a large group of persons 
guesses the responses to ten true-false items ; the average person will 
get 5 of them right, and two-thirds of the group will get cithfir 4, 5, 
or 6 right. Only about one person in a thousand will be so lucky as 
to get 10 right, or so unlucky as to hit on none of the right answers. 
If the guessing correction is applied, the scores corresponding to 10, 
6, 5, 4, and 0 right will be ; 10, +2, 0, 2, and —10 respectively. 

In a longer test the range of chance scores which may arise from 
guessing will be still greater, although their average (when corrected) 
will still be zero. Undoubtedly, then, we must expect true-false tests 
to be somewhat unreliable, but the effect w'ih be less marked in tests 
composed of three- or four-response items. Experimental evidence 
docs indeed show true-false tests to be less reliable than multiple- 
choice tests; and the latter are less reliable than tests made up of 
recall items. But the difference betw^een them is not large, and it may 
easily be offset by the fact that a true-false test can contain more items 
than a multiple-choice test which occupies the same length of time, 
since true-false items can be answered more quickly. Probably a 
good deal depends, also, on the skill of the tester in setting each 
particular type of question. 

Now, though we must allow that a considerable error may occur 
among the scores of a small proportion of examinees on account of 

* Note that there is no point in making any guessing correction when all 
examinees attempt an answer to every question. Their relative marks will be 
affected by the correction only when they omit some questions. In general, then, 
correction becomes less necessary the longer the time allowance for the examina- 
tion. 

" The situation is effectively the same as getting the persons each to toss a 
penny ten times, and counting up how many of them obtain 0, 1, 2, . . ., 9, 10 
heads, 
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guessing, in actual practice entirely random guessing (which was 
assumed in the previous paragraph) will very seldom occur. The 
examinee is more likely to possess an incomplete knowledge of 
certain items, sufficient, however, to make his guesses right ones. 
Teachers may regard it as unfair that such an examinee will obtain 
the same marks on those items as will the more able examinee who 
really knows the right answers. But a brief study of recognition-type 
questions, such as those given above, will show that the examinee’s 
shaky knowledge may not always be to his advantage. For the 
alternative (wrong) answers are usually worded sufficiently plausibly 
to deceive the weak student, and should indeed be based on common 
misconceptions. Hence his ‘semi-guesses’ will in practice be very 
frequently wrong. For this reason experimental researches indicate 
that the guessing correction fully compensates, or even over- 
compensates, for any advantages which the weak student might gain 
by guessing. Other investigations prove that the amount of guessing 
may be greatly reduced, and the reliability and validity of the test 
slightly increased, by instructing examinees not to guess, and by 
warning them that wrong guesses will be penalized. 

To sum up: random guessing may considerably distort the marks 
on a recognition-type test, especially when the questions provide 
only two alternative answers. It does not actually do so to any great 
extent, partly because guessing scarcely ever is random, partly be- 
cause it can be reduced by appropriate instructions, and partly because 
such tests can include large numbers of items which make for increased 
reliability. If the recognition-type of item is to be condemned, it must 
be primarily on account of some of the other reasons discussed below. 

Suggestive Effects of Wrong Responses . — Another common 
criticism which we can refute is that the presentation of false state- 
ments and wrong answers to the suggestible minds of pupils is 
pedagogically unsound. It is alleged that they may absorb such state- 
ments and later on believe them to be correct. This theory is an a 
priori one, and experiments which have been devised to try it out 
have always failed to provide confirmation. Probably the fact that 
examinees know that about half the true-false statements, and more 
of the multiple-choice responses, are wrong is sufficient to offset the 
effects of suggestion. Many teachers find that there is considerable 
pedagogical value in getting pupils and students to correct their own 
papers shortly after the test, so letting them see which responses are 
wrong (cf. Ballard, 1923). 
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The Elementaristic Tendency in New-type Tests . — We come now to 
the most obvious and most frequent objection to new-type tests. It 
is said that they measure only the most trivial aspects of ability, such 
as details of information. The tests may indeed analyse complex 
mental operations into elements which can be objectively marked, 
but they cannot put the elements together again, and in the process 
of analysis they omit many of the most important aspects of ability. 
They cannot, as can the essay-type of examination, show the ex- 
aminee’s general understanding of the subject, nor his interpretation 
of facts, his capacity for organizing and formulating his knowledge, 
nor his initiative and originality. Far from reducing the tendency of 
examinees to cram mechanically, they may increase the amount of 
rote memorization and discourage any desire to achieve a general 
grasp of the field. Investigation has, indeed, shown that students revise 
their work in a diiTcrent manner when they know they are to sit a 
new-type examination, and that they concentrate on acquiring such 
bits of knowledge as arc most likely to appear in the test questions. 
It is often alleged, though without proof, that secondary pupils who 
have been trained in the primary schools to pass objective selection 
tests find considerable dilliculties in adjusting to more normal 
forms of school learning. 

Such criticisms embody a number of half-truths, and may be 
countered by a number of arguments. The retort may first be made 
that even if the new-type test cannot measure ‘understanding, 
originality, etc.,’ the essay examination cannot do so either. For 
when examiners come to assess such qualities, they disagree widely 
with one another. The orthodox examination achieves a reasonable 
degree of reliability when it most closely approximates to a new^-type 
test, i.c. when it asks primarily for factual data, or when the ex- 
aminers draw up a detailed analytic scheme of the points which they 
wish to mark. Secondly, it may be pointed out that examinees en- 
gaged on an essay examination do not usually shine in regard to 
organization of knowledge and original thinking. Rather they dash 
down all the ideas and facts which they can recall, many of them 
irrelevant. ‘Thinking’ as contrasted with ‘reproduction of items of 
information’ is at a minimum. The new-type test taps all the relevant 
knowledge, with much less waste of time, and prevents the candidates 
from producing irrelevant material. 

Such recriminations arc also, however, only partially justified. 
Much sounder is the argument which points to weaknesses in the 
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critics* psychological analysis of the contrasted examinations. They 
commonly draw a sharp distinction between information and repro- 
duction, on the one hand, and understanding and thinking, on the 
other hand. They assume, that is, that these mental processes are 
uncorrelated. In actual practice the two cannot be separated so 
readily. The acquisition and reproduction of information always 
involves a certain amount of thinking, and no thinking is possible 
unless a person possesses information to think about. Hence, 
experiments in which attempts have been made to measure these 
faculties separately have usually shown them to be highly inter- 
correlated. It follows, therefore, that even when a new-type test is 
apparently measuring nothing but information, it is at the same time 
providing a pretty good measure of other more complex types of 
ability. It cannot ask examinees to ‘Discuss . . ., Compare . . etc.,’ 
nor make any overt provision for independent thinking, yet it is 
certainly drawing to a considerable extent upon these ‘higher’ mental 
processes which the orthodox examiner desires to assess. Conversely, 
a large proportion of the average essay-type answer consists of the 
sort of material which critics affect to despise when it is accurately 
measured by new-type tests. 

The overlapping between the two kinds of mental processes is 
further demonstrated by the fact that a new-type response, or a sen- 
tence in an essay answer, may depend upon ‘understanding’ in one 
examinee, but may result from ‘facile reproduction’ in another 
examinee, according to the respective methods by which these 
examinees have studied the topic in question. Nevertheless, it seems 
legitimate to make a partial distinction between the two types, and 
to admit that examinees have more opportunity of displaying ‘higher’ 
mental processes in essay-type than in most new-type examinations. 
And we must remember that, even if they do not usually write good 
English in their essay papers, yet they would probably never learn 
to write English at all if new-type examinations became the sole 
method of testing their abilities. This does indeed tend to occur 
among primary pupils in education areas where the selection ex- 
aminations are purely objective, and some authorities have re- 
introduced the composition examination in order to counter it. 

The reason why the criticisms of triviality and fragmentariness 
seem at first sight so justifiable and so serious is because many new- 
type tests are incompetently constructed. Questions which ask for 
bits of information are much the easiest to set, and, therefore, do 
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only too frequently constitute the bulk of home-made examinations. 
But there is nothing inherent in new-type examining which' makes 
better questions unattainable. And the writer would claim that at 
least half the questions in his own lest, above, do demand not only 
the possession of information, but also the application of principles 
or the interpretation of information; that is to say, ‘understanding.’ 

To conclude, then, the defect which we have been discussing is one 
which may occur in new-type tests, but need not do so to any great 
extent. Its effect upon the validity of the tests has been greatly ex- 
aggerated by critics owing to their faulty views of the nature and 
organization of mental abilities. 

Defects Due to the New-type Medium of Expression . — Our next 
objection is of a different order. It does not try to point out some- 
thing which new-type tests fail to measure, but rather to show that 
they measure a good deal over and above the abilities which we 
actually want to assess; and that this extra factor, having little 
educational significance, seriously distorts and reduces the validity 
of all new-type test results (cf. pp. 144, 168). Some of the most ardent 
advocates of new-type testing, c.g. Ballard (1923), do not seem to 
have taken this criticism into account. An excellent account of it may 
be found in Hawkes and Lindquist ct al (1937). 

If the reader will analyse carefully the processes which take place 
in his own mind when answering the new-type questions above, he is 
likely to find a good deal going on besides the recollection and 
application of his knowledge of the subject. He must first study the 
instructions to discover what he has to do in each question, and then 
manipulate his knowledge in such a way as to fit it in with the require- 
ments of the question. In a multiple-choice question he may arrive 
at the right response, not because he is certain that he knows it, but 
because he has eliminated the incorrect ones. Sometimes a single 
word in the question or in the provided response will enable him to 
choose correctly, without his having to think out the full implications 
of all that is printed. Some questions may, unknown to the author, 
contain clues to the correct responses which do not require any 
knowledge whatsoever of the subject. But if the reader is a sufficiently 
sophisticated new-type examinee, he will at once take advantage of 
such faults in construction. Examples of these irrelevant clues arc 
given in the next chapter. Unfortunately, the questions which are 
most liable to contain such errors are those that aim to evoke 
‘understanding’ rather than ‘reproduction.’ Straightforward questions 
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which ask primarily for factual information (e.g. Nos. 27-33 
and 42-47) are generally free from them. But more elaborate ques- 
tions (e.g. Nos. 34-41, 48-54) involve a great deal of ‘disentangling’ 
and complex inference. The ability to do such questions may depend 
rather largely on the examinee’s previous experience and practice 
with such tests, or upon some unknown group factor, and not 
solely on his knowledge of the subject being tested. 

Such sources of unreliability and invalidity may be much reduced 
if the examiner is sufficiently skilled in anticipating all the diverse 
mental processes which examinees may employ, but they are likely 
to be eliminated only if he falls back on questions of the simple- 
recall type, and asks merely for scraps of information. Actually no 
examiner can foresee all the possible weaknesses in his questions, for 
investigation shows that even tests constructed by experts may con- 
tain items whose validity is very low, or sometimes negative. That is, 
these items are done no better, and sometimes worse, by good than 
by poor students. A difficult multiple-choice item, for example, may 
contain wrong alternatives which are sufficiently plausible to deceive 
the majority of good students, but which are beyond the compre- 
hension of the poor students, who consequently just guess, and often 
guess rightly. Or the correct answer may contain some unsuspected 
ambiguity which is not noticed by mediocre students, who therefore 
choose it, but which is noticed by good students, who therefore reject 
it. Such faults in construction are still more likely to occur in tests 
devised by amateurs. 

It was shown in the previous chapter that many of the defects of 
the orthodox examination derived from the fact that examinees have 
to formulate or express their knowledge and ability in a distorting 
medium, namely essays, and that examiners disagree in their inter- 
pretation and evaluation of these complex mental products. We see 
now that much the same is true of new-type examinations. Examinees 
have to express their knowledge and ability in a dilTcrcnt but perhaps 
still more artificial medium. This ‘roundaboutness,’ and the com- 
plexity of the mental processes involved in arriving at their answers, 
undoubtedly reduce the validity of their marks. In examining average 
and dull pupils or adults it is not correct to claim that the new-type 
examination is unaffected by their poor literacy. For though writing 
is largely eliminated, the examinee has far more reading to do. 

The Subjectivity of New-type Tests . — A final defect in new-type 
tests, which has been entirely neglected by the majority of writers, is 
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that they arc, in a sense, just as subjective as essay- type examinations. 
Only the personal element enters, not in the marking, bQt in the 
setting of the questions. Pullias (1937) has demonstrated that when 
two or more examiners construct their own new-type test papers, 
intending to cover precisely the same field of knowledge and ability, 
and apply them to the same pupils or students, the average correla- 
tion between the two sets of marks is only of the order of 1 0-50. 
This figure may be unduly low because the examiners were teachers 
who may not have been highly skilled in test construction. Yet some 
thirty-five different teachers took part in the experiment, all of whom 
habitually employed new-type tests, and the results were very similar 
in dilferent school subjects, and among pupils of various ages. Pullias 
further found that standardized educational tests, which were alleged 
to measure ability in the same school subjects, only yielded an 
average inter-correlation of t 0-68. In some instances the co- 
efficients were decidedly higher (-[ 0*8 to I 0*9), in others very much 
lower ( I 0-2 to T 0*3). 

How may we account for such results? Actually the sources of un- 
reliability described on pp. 200-7 apply here, too. These were: 

1. Actual changes in the examinees, or function fluctuation. 

2. Inadequate sampling. As already mciUioned (p. 212), this 
should be much less prominent in new- than in essay-type examina- 
tions, especially if the instructions given at the beginning of the next 
chapter are followed. The trouble is not so much lack of thorough- 
ness of the questions as the subjectivity of their selection. 

3. Statistical defects. These can arise, not because dilferent exam- 
iners adopt dilferent scales of marks, but because the average level, 
and the range, of difficulty among their questions may differ. Such 
inconsistencies can, however, readily be eliminated, and they did not 
affect Pullias’s correlations. 

4. Subjectivity. The essay-examination is subjective because the 
examiner has to analyse the field of ability and decide which aspects 
are most important, and his personal opinion determines which 
characteristics of the examinees shall be assessed as good or bad. The 
new-type examination is subjective because the examiner has to per- 
form a similar process of analysis in choosing questions, and in pro- 
viding the good and bad answers between which the examinees are 
to discriminate. Again, the essay-examiner is unciblc to maintain 
objectivity because he must translate the answers presented to him, 
or ‘penetrate through the medium of expression’ in order to discover 
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the examinees’ real merit. The new-type examiner carries out an 
equivalent translation when formulating his questions. Hence diflfer- 
ent examiners will formulate questions about a given topic differ- 
ently, in just the same way that different essay-examiners will trans- 
late differently the answers to an essay-question on this topic. 

Pullias points out that the main stages in examining — (a) con- 
structing questions, (6) answering questions, (c) evaluating answers — 
in reality constitute a single whole. The new-type examiner has, in- 
deed, eliminated subjectivity from (c), but at the cost of greater sub- 
jectivity in {a). And his personal opinion has inserted itself into (6), 
since his questions embody a good deal of what, in essay-examina- 
tions, is normally carried out by the examinees. 

Now that we recognize this subjectivity as the chief flaw in new- 
type tests, we should be able to control and reduce it more readily 
than in essay-examining. Nevertheless, we must certainly conclude 
that both types of examination are imperfect mental measuring in- 
struments, and open to much the same objections. Conclusive 
evidence as to their relative validity is not easy to obtain, since usually 
it is only possible to compare their marks with the marks on sub- 
sequent, equally fallible, examinations. But by pooling the marks 
from several papers of both types, from class teachers’ estimates, etc., 
a fairly good criterion may be achieved. And according to this, 
neither the essay-type nor the new-type is definitely superior (cf. 
Lee and Symonds, 1933). The evidence of the superior validity 
of new-type tests which Ballard (1923) puls forward is quite uncon- 
vincing. In secondary school selection conventional, subjectively 
marked examinations are certainly inferior to objective tests. But 
when the marking is standardized there is little to choose ; sometimes 
the old type, sometimes the new, show better correlations with later 
secondary school achievement. Certainly there is no proof that the 
application of new-type tests in university entrance or scholarship 
examinations would give appreciably better or poorer predictions of 
ultimate university success than at present. Since, however, the errors 
which reduce the validities of the two are different, it follows that 
they measure somewhat different aspects of ability. Hence by com- 
bining them the respective errors partially cancel one another out, 
and the total validity is definitely superior to that of either of the 
separate types. And as certain features of a scholastic subject may be 
more readily tested by the one, other features by the other, each 
should be employed in the spheres to which it is best adapted. 



CHAPTER XIII 


CONSTRUCTION OF NEW-TYPE EXAMINATIONS 

The first stage in the construction of a new-type examination should 
be the preparation of a detailed statement of what the examination 
is intended to measure. A list of the topics which it is to cover should 
be written out, and a rough assessment made of their relative impor- 
tance, for example on a 1 to 5 scale. Suppose that the sum of all these 
assessments comes to 75, and that the examination is to last two 
hours. Probably about 150 questions will be needed. In that case 
each topic assessed as 1 in importance should eventually be covered 
by two questions, each assessed as 5 by ten questions, and so on. 
Naturally this distribution of questions need not be enforced mech- 
anically; but the principle of allotting most questions to the most 
significant parts of the field is important. If the test is designed simply 
to show the acquaintance of students with a particular textbook, the 
questions dealing with each topic may be proportioned to the number 
of pages on the topic in the textbook. 

We have already stressed the point that questions should not deal 
with trivial details merely because these arc the easiest to set. It is 
very unwise also for teachers to take sentences straight out of text- 
books and turn them into true-false or completion items. If this is 
done, the examinees will assuredly begin to study their textbooks 
with an eye to such sentences, and will try to learn them otTby rote 
rather than to understand them. Textbook sentences may occasion- 
ally be adapted as wrong answ'crs in multiple-choice questions, since 
they may then trap the parrot-learner. 

When the teacher and the examiner are one and the same person, 
suitable questions, or ideas for questions, are likely to occur to him 
while he is conducting the work of the class. These should be noted 
down and filed; they will save him much lime and elTort when the 
examination has to be made up. No examiner should expect to be 
able to devise a complete test at one sitting, unless it is intended to 
last only about a quarter of an hour. He w ill soon discover for him- 
self his speed of work, and will set aside sulficicnt hours, on a number 
of different days, to enable him to carry out the construction 
M.A. — 16 
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efficiently. The time needed will, of course, be much reduced if he 
can take over some of the questions from previous papers which he has 
set in the same subject. This is often possible since the papers are 
usually collected at the end of the examination, and not allowed to 
circulate freely round the school or university. Before he enters upon 
the final stage of construction, he should have in hand roughly double 
the number of questions he is likely to need. Many will have to be 
rejected for one reason or another, hence he requires plenty of spare 
ones. 

To reduce the subjective element in setting, it is desirable for 
several examiners to co-operate in supplying questions. In an institu- 
tion or department where examinations arc frequently set to large 
numbers, it is useful to build up a pool of questions from which a 
number of new-type papers can be drawn as the need arises. Each 
question is recorded on a card in a classified file, with details as to 
the examinations in which it has been used, and any data regarding 
its difficulty and validity. Periodically, questions that have been used 
too often are withdrawn, and fresh ones added to keep up with any 
alterations in the course of instruction. 

The choice of questions for any one paper is governed by quite 
a diflerent conception of difficulty from that which prevails in tra- 
ditional examinations. In accordance with the principles given in 
Chap. 11, the test should be so arranged that the poorest examinees 
will obtain marks very liitlc above 0%, the best ones very little 
below 100%, and so that the average will be as near as possible to 
50%. (Thf^se figures may then later be converted into some more 
conventional scale.) A few questions, therefore, must be so easy 
that almost all (say 95%) can manage them, a few so difficult 
that scarcely any (say 5%) can manage them. Moreover, all inter- 
mediate grades of difficulty must be adequately represented. By 
contrast, all the questions in an ordinary examination are usually 
arranged to be roughly equivalent in difficulty. It follows that 
in a new- type examination, several sections of the w'ork may be 
omitted altogether if they arc likely to be known so well that more 
than 95% of the students would answer questions on them correctly. 
It is a pure waste of time to include items which everyone can do. 
Conversely, sections of the work may have to be omitted if even the 
best 5% are unlikely to be able to answer questions dealing with 
them; such questions would also be useless. 

Thus the examiner’s next step must be to grade each of his tentative 
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questions for difficulty, i.e. to assess what proportions of examinees 
are likely to get each one right. The grading may be quite coarse, e.g. 
A 5 to 20%, B - 21 to 40%, C 41 to 60%, D -v. 61 to 80%, 
E = 81 to 95 %. These estimates are likely to be highly inaccurate at 
first, and to diverge widely from similar estimates made by other 
examiners. But if he takes the trouble to check his predictions after 
each paper, and notes how many examinees did actually pass each 
item, his capacity will soon improve. The final version of the paper 
should contain approximately equal numbers of A, B, C, D, and E 
items. ^ 

What particular form of new-type question to adopt depends en- 
tirely upon the subject-matter. Some things arc more readily ex- 
pressed in one form, some in another. No certain evidence of the 
superiority or inferiority of any type is as yet available. Experiments 
on this matter have been carried out; but the skill of the authors of 
the items has not been controlled. Suppose, for example, that mul- 
tiple-clioice items are found to be more reliable and valid than true- 
false ones, this may merely show that the experimenter himself 
better at compiling multiple-choice items. Nor can we definitely 
claim that different types measure distinctive abilities. The correla- 
tions between two multiple-choice tests, or two true-false tests, do 
not appear to be higher than those between a multiple-choice and a 
true-false test. This indicates that they both measure the same thing. 

There may, however, still be some advantage in probing the mind 
by a variety of techniques. And several different types may be em- 
ployed in a long paper so as to reduce monotony. In a short paper 
this is undesirable, since each type needs to be prefaced by full in- 
structions, and examinees may have to spend too large a proportion 
of time in reading the instructions, and in readjusting their ‘mental 
set.’ The following rule may be taken as a rough, though not a rigid, 
guide. If matching items are included, there should be at least five of 
them, each incUiding 5 to 10 responses. If multiple-choice items are 
included, there should be at least ten of them. If true-false, simple 
recall, or completion items arc included, there should be at least 
twenty of each. The numbers of other types of questions should be 
comparable to these figures. 

‘ If, however, the object of the examination is to select or cut otT, sa) , the best 
lO'/o, or the poorest 30°o, then maximum efiiciency will be obtained b> aiming 
as many questions as possible at these respective levels of difticiilty, and not 
bothering about their suitability for spreading out the marks of the majority 
(cf. p. 31). 
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When the questions have all been chosen they should be rearranged 
so that all those of any one type are grouped together. In each 
group the questions should be arranged in estimated order of 
dilBculty, starting with the easiest. If this precaution is omitted, the 
poor and medium examinees may waste a lot of time attempting 
difficult items at the start, and fail to reach easier ones which they 
could do, and the reliability of their results will be reduced. 

We will now consider each of the main types. 

SIMPLE RECALL AND OPEN-COMPLETION TYPE 

A question or short problem is followed by a blank space where 
the answer is to be written. Alternatively certain words or short 
phrases are omitted from a sentence or paragraph, leaving blank 
spaces to be filled in. For example:^ 

la. Who was the inventor of the steam engine? 

lb. The inventor of the steam engine was 

2. If = 2x5 + - + 3, what is$^? 

X ax 

3. Ohm’s Law states that x Resistance - 

4. When a salt is dissolved in water the molecules break up 

into A positive electric charge is carried by the 

a negative charge by the A balanced action takes place 

whereby XY The theory provides an explana- 
tion of the of electricity by salt solutions. 

5. A geography test may consist of a map on which numbers, 
but no names, are printed. Below is a column of these numbers and 
the examinee is instructed to write opposite each the town or river 
which it represents. The same technique can be used for testing 
knowledge of the names of parts of some instrument, mechanism or 
biological specimen. 

Each correct answer receives one mark; thus 6 marks may be 
scored on Question 4. 

The advantages of this type of item are, first, its naturalness and 
its similarity to ordinary questioning; secondly, its freedom from the 
chance guessing effects which occur in closed or recognition items ; 
and thirdly, the facility with which it may be constructed. This 
facility, however, is somewhat delusive, and unsuspected flaws are 

^ The author feels that an apology may be needed for the puerility of some of 
the illustrative questions in this chapter. They are intended to show the applica- 
tion of the technique to various school and university subjects, at varying levels 
of difficulty. Preferring to construct fresh ones rather than to quote from pub- 
lished tests, he is limited by the rustiness of his own education. 
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very likely to creep in. The chief disadvantage of the type is that its 
scope is practically limited to rote knowledge, or skill at* simple 
arithmetical and scientific problems. 

A number of hints on construction may be listed as follows. 

{a) Try to ask only for important items of knowledge. 

{b) Ensure as far as possible that there is only one right answer, 
and therefore confine the answers to single words or numbers, or to 
very short phrases. Some of the blanks in Question 4 violate this 
rule. The last one might be filled by ‘conduction,’ ‘carriage,’ or 
‘carrying.’ If there is any likelihood of alternatives, the examiner 
should list them all beforehand, and decide which ones to allow as 
correct. In this instance he would have to decide whether to disallow 
‘passage.’ Examiners may find it easier to conform to both these rules 
if they think of the answer first, and then build up the question or 
sentence around it, in such a way that the answer is completely 
delimited. 

(c) Keep the questions and sentences as short and as unambiguous 
as possible, and in particular avoid over-mutilation of a completion 
sentence by inserting too many blanks. A completion sentence should 
always be looked on as a form of question. See therefore that sufficient 
words arc left to make it an intelligible question. For the same 
reason, do not put blanks near the beginning of a sentence, which 
can only be answered by consulting later clauses of the sentence, or 
later sentences in the paragraph. Most of the blanks should be at the 
end of the sentences. 

{d) Study the whole content of a sentence or question to see that 
all of it is essential for the production of a correct response. In 
badly constructed items, it may be possible to deduce the response 
simply by the exercise of intelligence, without any knowledge of the 
subject-matter. In others the correct response may be given away 
somewhere in a later sentence; or hints may be derived from the 
grammatical construction. For example, blank spaces should not be 
preceded by ‘a’ or ‘an.’ An irrelevant clue is provided by the ‘an’ in 
the following example : 

6. A piece of land entirely surrounded by water is called an 

This difficulty could be overcome by wording the sentence as a 
question, or by printing ‘a(n) ’ 

{e) Do not vary the length of the blank space according to the 
size of the answer expected, as this would also supply irrelevant 
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clues. See that the spaces are big enough for examinees with large 

writing! 

(/) For purposes of easy marking, it is desirable to arrange for 
answers to be written in the right-hand margin. It is much more 
difficult for the eye to pick out answers scattered all over the page. 
Even the following is rather inconvenient : 

Write down the author of each of these plays. 

7. The Admirable Crichton 

8. Candida 

One solution is to number the blank spaces and to print a separate 
column of numbers opposite which the answers are written, thus: 

Write down the authors of each of these plays. 

7. The Admirable Crichton 7 

8. Candida 8 

A paragraph-completion test, with scattered answers, may be scored 
more efficiently by constructing a stencil of cardboard, with holes 
through which only the answers arc visible when the stencil is laid 
over the test paper. Under each hole on the stencil is written the 
correct answer. 


TRUE-FALSE TYPE 

This generally consists of a set of statements, approximately half 
of which are true, the rest false. The examinee has to indicate which 
is which. In a spelling test the ‘statements’ may be reduced to single 
words, whose correctness or incorrectness is to be checked. For 
example : 

Write a + after each of the following words which is correctly 
spelled, and a — after each one wrongly spelled. 

9. Medecine 

10. Embarrassment 

11. Accelerate 

12. Sieze 

13. Advertisement 

A modification which is useful for tests of spelling, grammar, and 
literary knowledge is to instruct the examinee to cross out, or under- 
line, the one word in each of a set of sentences which is ungram- 
matical, or wrongly spelled, or wrongly quoted. One example of 
each is given. ^ 

' These should perhaps be classed as multiple-choice items, since the examinee 
has to pick out the incorrect word from about half a dozen correct ones. 
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14. I met him and she in the town. 

15. The ruffian demolished all the aparatus in the laboratory. 

16. Rule, Britannia ! Britannia rules the waves. 

The ordinary true-false type has been commonly used in tests of 
all subjects, because it is capable of sampling very quickly a wide 
range of knowledge, and because examiners find it easy to construct. 
But the uncertainties due to guessing (cf. pp. 220-2) and, still more, 
its liability to unsuspected errors, have recently made it less popular. 
Some recommend that it should only be used wlien the multiple- 
choice type is ruled out for lack of plausible alternatives. Thus the 
following are justified since each involves only two possible answers : 

17. Distance east or west of Greenwich is 

known as longitude. True False 

18. When summer time starts the clocks 

are put back one hour. True, False 

There is no need to confine true-false items to rote information. 
Indeed, questions of general principle, of interpretation and the 
like can well be expressed in this form. Great care is needed, how- 
ever, to sec that they do not violate some of the following principles 
of construction. 

(nf) Statements should be simply worded, otherwise the> are 
liable to contain sections which arc true and sections which are false. 
In general it is best to use sentences wln'ch contain only a single 
clause, embodying only a single idea; to word them positively; and 
to refrain both from negatives and from conjunctions such as 
‘because’ and ‘therefore.’ The following examples are ambiguous, 
and therefore unsatisfactory. 

19. Napoleon lost the battle of Waterloo 

because his army had insufficient ammunition. True False 

20. Beethoven was a great composer who 

wrote many symphonies, operas, and sonatas. True False 

The most crucial part of a statement, which renders it true or false, 
should usually come near the end. 

(h) It is not easy to construct difficult statements, nor to estimate 
their degree of difficulty beforehand. They need to be sufficiently 
plausible to deceive the examinee whose knowledge is incomplete, 
and therefore not too obviously true or false. And the true ones 
should be just as likely to appear false as the false ones true. At the 
same time they must be clearly true or false to the able examinee, and 
therefore must not contain ‘tricks.’ Good students look out for 
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essentials, and so may be tripped up if errors are introduced into 
minor details. The following example is bad for this reason. 

21. Water boils at 212° and freezes at 32° C. True False 

(c) It has been found that certain forms of wording are much 
more apt to be used in true than in false statements, or vice versa. 
Examinees soon get to recognize these, and so may answer correctly 
without any knowledge at all of the subject-matter. The great majority 
of statements, in amateur new-type tests, which contain the words: 
‘All,’ ‘Always,’ ‘Never,’ ‘Impossible,’ are f:ilsc. The majority of those 
which contain ‘Usually,’ ‘Often,’ and the like, are true. Most very 
long statements also tend to be true more often than false. There is 
no need to discard the above words altogether, but they should be 
used with caution and, if possible, introduced as frequently into 
true as into false statements. 

(d) tf the number of statements is less than, say, twenty, there 
should not be an exactly equal division between true and false, else 
examinees will count up to make sure that they have checked them 
half and half. 

{e) The order in which the statements appear must, of course, be 
entirely random. If the examiner never prints more than two true or 
two false statements consecutively, examinees may again discover 
this and take advantage of it. One way of arranging is to toss a coin. 
If it shows heads, take the easiest of the true items as No. 1. If the 
next four tosses arc tails, take the four easiest false statements as 
Nos. 2 to 5 ; and so on. 

(/) Various methods for recording the responses have been used : 
underlining the words True or False, or Yes or No, drawing a circle 
round the letters T or F, or writing the letters T or F. One of the 
simplest and most satisfactory is to write a F or a -- in a blank 
space at the end of eacn statement. These spaces should all be 
aligned in a straight column in the right- or left-hand margin. 

(g) The correction for guessing, described on p. 221, should be 
applied, and examinees should be warned of this in the general 
instructions at the beginning of the test. 

MULTIPLE-CHOICE TYPE, INCLUr^ING BEST REASON AND 
MATCHING ITEMS 

A large number of variants of this type may be distinguished, 
examples of most of which are quoted below. 

(i) The number of alternative responses is usually five, but it may 
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range from seven to three, or even two, according to the number of 
possibilities implicit in the question. Six- or seven-response questions 
occupy rather a lot of space. Three- or two-response questions should 
usually be avoided because they require the application of guessing 
corrections. In any one new-type test it is convenient, though not 
really necessary, to provide the same numbers of responses for all 
the questions. 

(ii) When the responses consist of single words, they can be 
printed all on one line, consecutively. They should be enclosed in 
brackets, and preferably be printed in italics. The examinee is then 
instructed to underline the one word in each set of brackets which is 
correct. For example: 

22. They (was, were) going (to, too) London (too, to) visit (their, 
there) aunt, (who, what) lived (there, their). 


When, as in most of the remaining examples, the responses consist 
of several words or phrases, underlining is troublesome both for the 
examinee and for the marker. The responses may sometimes be 
printed consecutively and numbered, and the examinee writes the 
number of his choice in a blank space. For example: 

23. What is the function of the sacral autonomic nervous system? 
(1) To increase circulation and respiration. (2) To control posture 
and movements of the liinb^. (3) To increase the activity of the 
reproductive organs. (4) To contract the sphincters of the excretory 
organs. (5) To inhibit the activity of the alimentary canal. 

....3.... 


This type of question is, however, incomcnient to read, hence it 
is more usual, as in the examples below, to print the responses in a 
column. The examinee indicates his choice by a cross in one of the 
blank spaces in the margin. Unfortunately this method tends to 
consume a lot of space. 

(iii) The items may be stated in question or in completion form. 
Thus Question 23 might equally well commence: 

23. The function of the sacral autonomic nervous system is : 

( 1 ) . . . 

(iv) In the ‘best reason’ type, the examinee is instructed to check 
the best of a set of reasons for a given statement. Occasionally a 
more suitable item may consist in finding the worst reason. For 
example : 

24. Many investigations indicate that children’s intelligence is 
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somewhat affected by environment and education. Which of the 
following provides the poorest evidence for this conclusion? 

The correlation between the intelligence of orphans 
and their parents is lower than that between 

children and their parents who have reared them 

The average difference between the I.Q.s of pairs of 
identical twins is higher among those reared apart 

than among those reared together . . 

Children tested before placement in foster homes, 
and again after several years, show a greater in- 
crease in intelligence in the better than in the 

poorer homes . . . . . 

Children of professional parents are on the average 
markedly superior in intelligence to children of 
unskilled labourers - . X . . . 

(v) Two or three responses, instead of one, may be correct. 
Occasionally none of them may be correct. The instructions should 
make it absolutely clear to the examinees what is wanted. For 
example : 

25. Put a cross opposite two of the following English painters 
who are especially famous for their landscapes: 

Constable . . . x . . . Reynolds 

Hogarth Romney 

Lawrence Turner . . . x . . . 

26. Put a cross opposite one answer only. 

Sir Thomas Browne is best known as the author of 

Fiction . 

Plays . ' 

Poetry . . 

Works of travel 

None of these . . . . X . . . 

(vi) The same set of responses may apply to several questions. For 
example: 

Identify the following orchestral instruments by drawing a circle 
round S for strings, W for woodwind, B for brass, and P for per- 
cussion : 

27. Clarinet • S @ B P 

28. Triangle • A W B (W 

29. Double bass ' (D K ® ^ 

30. English horn ' 5 ® A P 

31. French horn • S W @ P 

(vii) Matching tests are an extension of the previous category. A 
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number of questions and a number of responses are listed in different 
order, and have to be fitted together. For example : 

32. On the left is a list of Treaties or Peaces, on the right a list 
of historical developments which were associated with these Treaties. 
After each of these developments write the letter of the Treaty to 
which it corresponds. Note that some of these letters need not be 


used at all, and that 

some may be used twice. 


Peace or Treaty of : 

Historical Development: 


{a) Berlin 

Napoleonic empire broken up 

■ • g- 

{h) Frankfort 

League of Nations established 


(c) Paris 

German annexation of Alsace- 

(c/) Tilsit 

Lorraine 

...h. 

(e) Utrecht 

France regains Alsace-Lorraine 


If) Versailles 

Independence of Holland finally 

(g) Vienna 

recognized 


(//) Westphalia 

Britain becomes supreme colonial 



power 

Western powers set the stage for 

, . . .c. 


Balkan wars 

. . .a. 


Many matching tests have equal numbers of items in both columns, 
each item corresponding to one, and only one, in the other column. 
Since, however, this makes possible the discovery of responses by a 
process of elimination, the above plan, with unequal numbers, is 
superior. The numbers of items should usually be five to twehe in 
each column ; a greater number invokes waste of time in searching. 
When possible the items in at least one column should be in alpha- 
betical order, to facilitate the search. It may be noted that this type 
is more compact than ordinary multiple choice, and so saves con- 
siderable space. 

The multiple-choice type of question is the most generally ap- 
plicable in educational testing — to diagrams, maps, etc., as well as to 
verbal or numerical problems. Not only factual information, but also 
quite elaborate reasoning and interpretation can be measured by it. 
In the best-reason type, for example, the alternati\cs provided may 
be graded in reasonableness; but only one answer should be com- 
pletely correct. Again in matching tests, the examinees can be re- 
quired to apply a series of principles to a series of problems. Very 
great care, however, is needed in the construction, and the following 
hints may be useful. 

{a) The multiple-choice type should not be used when simple- 
recall questions would be adequate, i.e. when there is only one 
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definitely right answer to the question. For example, it would be a 
waste of time to express arithmetical problems this way : 

33. |x^=i,l, li,2. 

Indeed, most mathematical problems are better in open than in 
closed form, since the latter may enable the examinee to work back- 
wards from each answer until he reaches the right question, which is 
a very different matter from working forwards from the question. 

(A) All the responses provided must be sufficiently plausible to be 
selected by a fair proportion of the examinees. Otherwise there is no 
object in including them. The incorrect alternatives can often be 
made up out of misconceptions and errors which are likely to 
be prevalent among the examinees. Sometimes, also, they will be 
selectecl by many of the poorer examinees if they contain several 
words or phrases identical with words or phrases in the question or 
introductory statement. Both correct and incorrect responses should, 
however, be homogeneous in their mode of expression, length, and 
other external characteristics. If the correct answer is always longer, 
or always shorter, than the incorrect ones, examinees may discover 
this and employ it to their advantage. 

(c) Each response should be studied for possible irrelevant clues. 
For example, each one must be grammatically consistent with the 
question. This is especially necessary in matching tests. Some testers 
throw together such heterogeneous items that the examinees may be 
able to match them without any real knowledge of the subject. The 
following is an extreme (Hypothetical) instance of a bad item which, 
although supposed to show knowledge of geographical terms, could 
be answered correctly by inference alone. 

34. (a) A geographical line along 

which atmospheric press- 
ures are equal 

(b) Scottish lakes or sea inlets 

(c) Winds which blow at a cer- 

tain season in the Indian 
Ocean 

(d) A high piece of land from 

which rivers flow in two 
or more directions 

(e) Lines on a map showing 

heights above sea-level 

‘Contour lines’ must be (e), because ‘lines’ are mentioned in (e). 


' Watershed 
Monsoons 
' Isobar 
Contour lines 
Lochs 
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‘Monsoons’ and ‘lochs’ must be either (b) or (c), since these are the 
only remaining plurals. It is unlikely that ‘monsoons’ would be 
printed so nearly opposite its corresponding item, moreover ‘lochs’ 
sounds like ‘lakes,’ hence ‘lochs’ must be (b). (a) and (J) mention 
‘geographical line’ and ‘rivers,’ hence "isobar" and "watershed" will 
probably fit them. 

The examiner should remember that a matching test is an ex- 
tended form of multiple-choice item, and should therefore see that 
for each item in one column there arc at least three items, preferably 
four or more items, in the other column which constitute possible 
answers. 

((/) Not infrequently a better item may be obtained by turning 
round a partly formulated question into an answer, and making the 
correct answer into a question. For example: 

35. Emphasis by means of an understatement is called : 

Mciosis . . 

Hyperbole 

Euphemism . 

Metaphor 

This might be re-formulalcd as follows: 

36. Mciosis denotes : 

Emphasis by means of an understatement . . . . . x . . . 

An exaggerated rhetorical expression . . 

Substitution of a pleasant for an offensive expression 

An imaginative simile . . . . 

Probably the second form (No. 36) requires a fuller understanding 
of the subject than docs the first (No. 35). A third, perhaps still 
better form of the question, which would demand the application and 
not merely the possession of knowledge, would be a set of phrases 
from among which examinees would select the one showing meiosis. 
In general the examiner should study each item in an attempt to 
anticipate all the menial processes by means of which the correct 
answer may be reached. If any of these processes are not representa- 
tive of the ability that he w ishes to measure, then he should modify 
the item. 

(e) All statements and answers should be as succinct as possible, 
consistent with accuracy. Very verbose or complicated questions 
may depend more on the candidate’s reading ability and intelligence 
than on his knowledge of the subject. (Possibly Qu. 24 offends in 
this respect.) 



242 


THE MEASUREMENT OF ABILITIES 


(/) It should hardly be necessary to point out that the position 
of the correct answer in the series must be chosen entirely at random. 
First and last places should be employed as often as any of the 
intermediate places. 


REARRANGEMENT TYPE 

A series of items which normally fall into a certain specific order 
is listed in disarranged order, and the examinee has to rearrange 
them. For example 

37. Below is a list of waves or rays. Opposite No. 1 in the right- 
hand margin, write the letter (a, or b, or c, etc.) of the wave or ray 
with the longest wave-length. Opposite No. 2 write the letter of the 
wave or ray with the second longest wave-length, and so on down 


to No. 7. 

{a) Rays from a sodium vapour lamp 

1. . . . 

./... 

(Z>) Waves from a black kettle containing hot 

2. ... 

d... 

water 

(c) Gamma rays 

3. ... 

b... 

{(1) Television transmission waves 

4. ... 

a . . . 

(e) Rays from the green of the spectrum 

5. ... 

e. . . 

(/) Wireless telephone waves 

6. ... 

g‘ • • 

(f) x-rays 

7. ... 

c. . . 

The mental processes involved in answering this 

question 

will 


naturally depend on what has been taught. If the pupils have been 
provided with such a list, they may answer by rote recall ; if not, then 
they may have to perform an elaborate process of analysis and re- 
synthesis of their knowledge of various branches of physics. 

By disarranging the order of phrases in sentences, Ballard (1923) 
provides a test which, he claims, measures knowledge of sentence 
structure and of the canons of English style. The time sequence of 
historical events, the sizes of geographical areas, the order of opera- 
tions in some chemical or physical experiment, grammatical se- 
quences in foreign languages, and the like, can readily be expressed 
in this type. It has not, however, been very widely used, probably 
because of the difficulty of devising a simple and just method of 
scoring (cf. Sims, 1937). Ballard’s (1923) plan seems fairly satisfactory. 
Award one mark if the first item is correctly identified ; thereafter 
award one mark for each item which correctly follows the one 

^ The reader may wonder why the instructions arc not simplified by requiring 
examinees to write 1 after the item that should come first, 2 after the next, and 
so on. The reason is that the scoring of such a response would be exceedingly 
time-consuming, unless the whole of it was correct. 
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preceding it. For instance, if an examinee answers Question 37 ; 

(/) (d) (a) (e) (g) (3) (c) instead of (/) (d) (3) (a) (e) (g) (c), 

(/) is correctly placed first and scores one ; (c/) correctly follows (/) 
and scores one; (a) should not follow (c/), but (e) and (g) are in 
correct sequence, and so score two; (c) does not score, although it is 
correctly placed last, because it should not follow (b). The total 
score here is thus 4 marks. 

Certain other minor types of question have been described by 
Ruch (1929), Rinsland (1938), and others. But these all appear to be 
variants of the simple-recall or completion, true-false, or multiple- 
choice types. 


PRACTICAL EXAMINATIONS 

Probably one of the most unreliable of conventional examinations 
is the practical in physics, chemistry, and biology and the engineering 
trade test; because a 3-, or even a 6-, hour period can provide only 
such a limited sampling of the skill or knowledge of the examinees. 
Though such examinations cannot be converted to wholly objective, 
multiple-choice form, it was found by American institutions for 
technical training of recruits during the war that great improvements 
were possible along the same lines. The operations at which examin- 
ees are supposed to be competent arc analysed into a series of steps, 
and some 30 to 50 typical problems are selected, each of which can 
be done in a few minutes. For example, in electricity, a certain circuit 
is nearly completed, and the examinee has to make one essential 
connection which will show his grasp of the circuit without his hav- 
ing to do the whole wiring. In chemical analysis, he may be told that 
such and such reagents have given such and such results ; what should 
he do next? Several questions in an engineering examination may be 
based on capacity to choose the right tools for a given job. In others, 
parts of machines may be laid out and the examinee required to 
indicate how they should be repaired, and so on. Each of the fifty- 
odd questions needs to be supervised by a separate examiner; 
senior students can often be employed for this purpose. The ex- 
aminees move round from one problem to another in turn, as many 
being examined simultaneously as there are problems. Each problem 
is usually marked as right or wrong, the right response being care- 
fully defined beforehand; though occasionally half-marks may be 
given if partial competence can be adequately defined. 
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final stages in the construction of a new-type 

TEST 

The examiner should rescrutinize every item, try to eliminate all 
ambiguities, and ensure that the whole of each item is going to 
function, i.e. that every word or phrase will be essential in arriving 
at a correct response, and that there is no possibility of obtaining 
correct responses through irrelevant clues (cf. Hawkes and Lindquist 
etaU 1937). This can be done much more effectively if the questions 
are submitted to other experienced examiners who may detect 
flaws which the author has not noticed. 

Next it is essential to make up comprehensive instructions for each 
type of question, which will be appropriate to the mental level of the 
dullest examinees. Unless the examinees are fairly sophisticated to 
new-type testing, sample questions already answered should be in- 
cluded. In classroom work it is a good plan to provide three samples, 
one already answered, to read this through with the class and point 
out why the given answer is correct, and then to ask the class to 
suggest answers to the other samples, helping any individual mem- 
bers who still do not understand what is wanted. 

Attention should then be directed to case of scoring. If all the 
answers can be aligned in a column in the right-hand margin, a key 
can readily be prepared consisting of a sheet for each page of the 
test, down the left-hand edge of which arc written the correct answers 
in the correct positions. 

It might be supposed that as some questions are harder than 
others, they should be allotted more marks. But actually many 
experiments show that weighting is hardly worth the trouble. The 
unweighted total scores ^ correlate almost perfectly with weighted 
totals. A very simple form of weighting such as the following may be 
used: award 1 mark to each correct simple-recall or completion 
response, also to each correct response in a matching item ; give 1 
mark to each true-false or two-response multiple-choice item and 
apply the R-W correction; give 1 mark to each three-response 
multiple-choice answer, but neglect the correction for guessing ; give 
2 marks to each multiple-choice item where four or more responses 
are provided. 

Although scoring is so rapid and fool-proof, slips do occur, 
particularly in adding up totals. Either then the marker should go 
through the papers twice, or else let the students or pupils re-mark 
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their own papers and ask them to bring any errors to his notice. 

Finally the examiner will certainly improve his future examining 
if he takes the trouble to study the difficulty and the diagnostic value 
of each item. A simple method is to sort the papers into four piles — 
the quarter with the highest totals, the quarters above and below the 
median, and the quarter with the lowest totals. The numbers of times 
each item is correctly answered by members of each group, and by 
the four groups combined, may be tabulated. The examiner may 
then compare the actual difficulties of the items with his estimates of 
difficulty made when setting the paper, and may note whether the 
proportions of passes for each item ascend regularly from the lowest 
to the highest group. If they fail to dc so he should attempt to 
discern the flaw in the item which may be responsible for its poor 
validity. 


M.A.— 17 



CHAPTER XIV 


IMPROVEMENTS IN EXAMINING 

Examinations have far too great a burden to bear. They are sup- 
posed to indicate, as shown in Chap. XI, a large number of only 
partially related traits and abilities. If we could define more clearly 
what it is that we want to measure, and use for each purpose the 
most appropriate instrument, there is no doubt that we could effect 
a great improvement in the efficiency of these instruments. Let us 
consider first some of the substitutes for examinations which have 
been recommended by educationists, and the functions which they 
may serve. 

Oral Examinations, Viva Voces, and Interviews , — Oral examina- 
tions are, of course, much older methods of educational measure- 
ment than written examinations. Their main defects were pointed 
out when written examinations began to take their place, over one 
hundred years ago. It was recognized that the situation to which 
successive candidates are subjected cannot be the same for all, so 
that no sure basis exists for a comparison between different candi- 
dates. The examiner’s tone of voice, the wording of his questions, 
his encouragement or criticism, etc., may vary widely from one 
candidate to another. If^ as in school inspections, all the candidates 
are interviewed in the presence of one another, the examiner has to 
pose a series of different questions which are unlikely to be closely 
comparable in difficulty. In any viva voce the amount of time de- 
voted to each candidate is usually so short that reliable indications 
of his abilities can hardly be expected. There is the further defect that 
the situation may cause much greater emotional disturbance than 
does a written examination. The interviewer may wish to take certain 
qualities of temperament and character into account in his assess- 
ments, but the emotional qualities which are most prominent during 
the interview are probably not very typical of the candidate’s usual 
behaviour, nor of any great significance for educational purposes. 
When care is taken to put the candidate at ease, skilful questioning 
may, indeed, reveal something of his interests, his attitude to work, 
and cultural background; which are educationally relevant. But as 
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often applied in 1 1 + selection, the secondary school Head’s inter- 
view probably does little more than assess social class and nice 
manners. Oral examinations in foreign languages will bring out the 
candidate’s goodness of pronunciation, but cannot be trusted to 
show his facility in conversation, because of the distorting effects of 
emotion. For the same reason, viva voces in other subjects cannot 
throw much light on the candidate’s capacity for formulating his 
knowledge orally. Theoretically they should constitute a valuable 
supplement to written examinations, since they provide a fresh 
medium of expression for ability. But investigations have shown that 
Judgments based on interviews tend to be still less reliable than those 
based on essay-examination answers. 

Hartog and Rhodes’s (1935) study of the Civil Service interview 
is particularly apposite. Sixteen candidates were interviewed in the 
ordinary way by two different boards, each composed of four or five 
experienced examiners. The members of each board consulted 
beforehand as to the questions they would ask (but the two boards 
did not inter-communicate), and had before them the candidates’ 
educational records. During the quarter- to half-hour interviews the 
members attempted to assess a candidate’s alertness, general in- 
telligence, and the suitability of his personality for the Civil Service. 
After discussion they drew up a final mark. It was found that mem- 
bers of any one board did agree fairly closely in their judgments, but 
that the two boards had reached widely differing conclusions. The 
two final marks differed by 12% for the average candidate, in one 
case by as much as 31 %, in another by as little as 1 %. The correla- 
tion between the two sets of marks was only +0-41 ± 0-14. Except 
for the small number of the candidates, there seems to be no 
technical flaw in this experiment which would allow us to challenge 
its depressing results. 

There have been many other studies of the interview in the 
educational, vocational, and military fields, showing similar in- 
consistencies among interviewers (cf. Vernon and Parry, 1949). In 
view of the frequent use of the interview by university authorities for 
selecting students, Himmelweit and Summerficld’s (1951) study is 
very relevant. They obtained almost zero correlations between pre- 
dictions made by university staff from interviews and subsequent 
degree performance, whereas an academic entrance examination gave 
small positive correlations and a battery of psychological tests 
distinctly better ones (cf. also Vernon, 1953). It would seem that 
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oral questioning and the interview, though essential in education for 
purposes of teaching, are useless for purposes of assessing ability or 
ihe results of teaching. In the course of a school year a teacher can 
undoubtedly build up a fairly comprehensive and reliable picture of 
each of his pupils’ abilities, which may be largely derived from oral 
work. But the conditions here are very different from those in a 
brief viva voce. The volume of questioning is far larger, the emotional 
atmosphere less strained, and the oral answers are supplemented by 
written work. 

Perhaps the best instance of the value of the interview is provided 
by the work of vocational psychologists. In advising a child on his 
career the psychologist applies several objective tests of aptitudes, 
and takes both medical and educational records into account. But 
his information as to the child’s personality and interests is derived 
chiefly from an interview. This is conducted informally so as to put 
the child at case, and though it follows no stereotyped plan of ques- 
tioning, it has, in fact, been designed very carefully so as to elicit 
the child’s main traits. A further function of this interview is that it 
enables the psychologist to interpret the test results and other 
objective records, and to realize their significance in a synthetic view 
of the child as a whole. Guidance, which is based merely on the 
mechanical application of tests or examinations, is not successful. 
But there is ample proof that the guidance given by trained psycho- 
logists who adopt these test and interview methods docs possess a 
high degree of predictive validity. Again in vocational selection for 
skilled trades, the psychologist can often use the interview success- 
fully for assessing the amount and the relevance of previous trade 
experience, though — when conditions permit — he also constructs 
and validates a battery of selection tests (cf. ftn. p. 145). F'or higher- 
grade occupations such as managers, officers in the Services, and 
teachers, where tests seldom seem to be of any value, the technique 
of observing small groups of candidates (as developed by War 
Office and Civil Service Selection Boards), combined with a psycho- 
logist’s interview, is the most successful (Vernon, 1953). 

It would seem, then, that the interview, as at present employed in 
educational circles, is practically valueless, but that in skilled hands 
it may become a most useful tool. There is no reason other than 
expense and the scarcity of competent psychologists why the methods 
of vocational selection and guidance should not be applied in educa- 
tion, both for choosing pupils most likely to benefit from higher 
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education, and for advising them as to the lines of study for which 
they are best fitted. This is already common practice in many 
American and some Australian Universities. It is occasionally used 
here at the IH selection stage, though necessarily confined to a 
small proportion of borderline candidates. 

Intelligence and Special Aptitude Tests. — Theoretically, intelligence 
tests should be able to fulfil one of the functions commonly assigned 
to examinations much better than examinations do at present, that 
is the selection of pupils or students with the greatest intellectual 
promise, as distinct from achievement. As we have seen, this is not 
because they really measure innate ability, but rather because they 
show the general level of thinking that the child has developed out- 
side school, which he should be able to apply in school. They should 
also he able to measure objectively the qualities which essay-type 
examinations are supposed to elicit, and which new-type tests are 
supposed to be incapable of eliciting, such as ‘originality, power of 
thinking, grasp of general principles, etc.’ Experimental investiga- 
tions indicate that they do have some value for such purposes, but 
that they need to be employed with caution. They seem to be most 
useful among younger children. The Terman-Mcrrill, applied at 6 or 
7 years, correlates very highly with educational achievement for the 
next few years; coclficients of i 0*80 or more may be expected. At 
the 11 1 selection stage, the group intelligence test usually gives 
slightly better correlations with subsequent achievement than any 
other single instrument. Within a selected grammar school popula- 
tion, correlations wanild be expected to drop to about + 0-40, hence 
there will naturally be many discrepancies between l.Q. and bright- 
ness as observed by the school teachers. But when these teachers 
criticize tests they fail to realize how very much worse most of the 
pupils with I.Q.s of 1 10 down to 70 would be. In the modern school, 
correlations remain relatively high. 

At later ages, the agreement between general intelligence tests 
and academic work is usually lower still, because of the still greater 
restriction of range. In American universities, with their more 
heterogeneous students, coefficients still average around 0*4 to 0-5 
(cf. Eysenck, 1947/?). But here the figure is usually nearer 0-2 to 0-3 
with degree results. Similar correlations are found among Training 
College students with their combined theory marks; with practical 
teaching marks there is little or no agreement. 

More useful predictions at late adolescent and adult levels are 
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likely to be reached through testing special aptitudes or group fac- 
tors. Prognostic tests arc intermediate between intelligence and 
attainment tests. They try to show, not how much mathematics or 
English the pupils have previously acquired, but how well they can 
apply their intelligence plus acquirements to fresh mathematical or 
linguistic problems. Earle (1949) has had some success with English, 
science, and mathematics tests, and with his Duplex tests, in diag- 
nosing success at different types of secondary courses. Many Ameri- 
can colleges give ‘placement’ examinations to incoming freshmen, 
consisting of linguistic and numerical + non-verbal intelligence tests, 
together with English, mathematics, science, foreign language, and 
social studies tests, in order to determine the students’ main aptitudes. 
There is room for considerable development of such work in our own 
schools and universities (cf. Dale, 1954). 

Cumulative Record Cards , — ^The primary function of many exam- 
inations is to provide a certificate of work done. Ostensibly this is 
true of the General and Scottish Leaving Certificates, and of univer- 
sity degree examinations, though they are also often interpreted 
prognostically. Now, in most schools the teachers know quite well 
what work each pupil has done. Hence there is every reason for 
making more use of teachers’ records of this work and less use of 
external examinations. The former cover the whole ground, the latter 
never more than a fraction of it. The former are collected under 
everyday conditions, the latter when many pupils are in an abnormal 
emotional state due to cramming and fear of failure. The former are 
also less likely to produce the degrading effects upon teachers’ and 
pupils’ educational values which are characteristic of public com- 
petitive examinations. Teachers’ markings of school work are, of 
course, just as liable as examiners’ marks to subjectivity and un- 
reliability ; perhaps more so, because of the prejudices they are likely 
to possess regarding each pupil’s character and ability, which may 
influence their judgment of his written work. But if in the course of 
his career a pupil is taught by several different persons, each of whom 
records his achievements according to some uniform system, the 
cumulative record should yield a fairly reliable picture. Although, 
as stated above, the function of such records should be to certify 
work done rather than to predict future promise, yet the investiga- 
tion by the Scottish Council for Research in Education (1936) into 
the Leaving Certificate showed that teachers’ marks did in fact cor- 
relate with subsequent university success just about as well as did the 
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external examiners’ marks. Likewise in McClelland’s (1942) and 
several subsequent studies of secondary school selection, scaled 
primaiy school teachers’ estimates have given better predictions of 
grammar school work than have objective attainments tests. The 
many other uses to which cumulative records can be put, such as 
vocational and educational guidance, have been fully described by 
Hamley et al (1937) and Walker (1955). 

The objection which will at once be raised to such schemes is — how 
are the standards of different schools to be compared? McClelland 
(1942) demonstrated the enormous discrepancies between teachers’ 
assessments of their pupils and the levels actually obtained by the 
same pupils on objective tests, some teachers being far more generous 
than others. Indeed unless some system of scaling or standardization 
of assessments is adopted, the inclusion of such estimates in a selec- 
tion procedure is known to lower its validity. Several possible 
solutions to this difficulty have been tried. 

Valentine (1938) proposed that at 11 + one or more general in- 
telligence tests should be given in all schools under one authority. 
If, say, one thousand grammar school places were available, and an 
I.Q. of 115 was found to cut off the top thousand pupils, then the 
number of pupils in each school scoring 115 and over constituted 
that school’s ‘quota.’ Meanwhile the head and staff of the school 
would have ranked their own pupils in order of suitability for 
grammar school education. This rank order and the I.Q.s would 
finally be combined with appropriate weights. Each school would 
send its allotted quota to the grammar school, but the precise pupils 
included in the quota would be those whom the primary teachers 
had chosen as most able, not merely those with the highest I.Q.s. 
While this quota scheme has, in fact, been used in certain urban areas, 
it is quite unsuitable for areas where there are numerous small 
schools, each submitting a dozen or fewer candidates. Even with big 
schools it is rather unreliable since, as shown above (p. 209), a percen- 
tage or proportion has a very large Standard Error. 

An alternative plan, which is receiving experimental trials, is to 
fix each school’s quota initially at the same number as gained gram- 
mar school places in the previous year (or recent years). A few pupils 
who are ranked by their school as just above or below this borderline 
are examined, and the quota is adjusted up or down according to 
their results. This method is referred to by the Ministry of Educa- 
tion as selection without any external examination — not even [an 
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intelligence test. We have already pointed out that fairly considerable 
variations in standards may occur, by chance, from year to year 
(p. 63). Thus it is possible that the method, though admirable in 
intention, may break down through the necessary quota adjustments 
becoming too numerous and unwieldy. 

Many American Universities employ what is, essentially, an elabo- 
ration of the same scheme. They record the final secondary school 
marks of students coming from various contributory schools for 
several years, together with the performance of these students at the 
university itself. Eventually it becomes possible to determine the 
relative standards of the secondary schools concerned. For example 
it may be found that students with 60% from School A are just as 
good as those with 80% from School B. Thus the numbers of students 
worthy of selection from each school can be deduced. 

Probably the best plan is that advocated by McIntosh ct al ( 1 949), the 
National Foundation for Educational Research and others, of scaling 
each school’s own marks or rankings against some uniform external 
set of attainments tests, by an adaptation of the techniques described 
above (pp. 54-8). Selection can then be based primarily on the teach- 
ers’ judgments, derived from internal school examinations and from 
cumulative records of junior school work. But each set ofjudgmenls 
is calibrated or scaled up or down to accord with the distribution 
obtained by that school’s candidates on the external, objective tests. 
An agreed weighting can be given to the objective tests also. Such a 
scheme too requires fairly large numbers per school, though Mc- 
Intosh finds it workable even among schools submitting only half a 
dozen candidates. Tt has the disadvantage that schools are still likely 
to cram for the external tests, but it does take full account of teachers’ 
knowledge of their own puj/ils. A useful discussion of these and other 
problems raised by the various selection procedures at 1 1 is given 
by Dempster (1954). 

Methods such as these require to be worked out and supervised by 
a statistically trained psychologist. As described here, they might not 
appear to be appropriate as substitutes for the General Certificate 
or other higher-level examinations. And yet the accrediting systems 
adopted by many Australian and New Zealand Universities show 
that something similar can be devised, which will largely banish the 
strain and unreliability of externally-conducted examinations among 
secondary school leavers, and yet provide a reasonable guarantee of 
scholastic achievement and of suitability for university courses. 
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IMPROVEMENTS IN ORDINARY EXAMINATIONS* 

Sinceexternal examinationswill for a long time play a large, though 
we hope diminishing, part in education, and since even in an ideal 
system teachers must apply internal examinations, and mark their 
pupils’ work, we must consider carefully the improvements which 
might be made in the examination as it now is. From the investiga- 
tions we have described, and from the discussion inpreviouschapters, 
the following principles may be deduced. 

1. The teacher or examiner should formulate as precisely as pos- 
sible what it is that he wants to measure. This implies first a clear 
conception of the objectives or aims to which, in his view, education 
should be directed. At the primary school level he may be content 
with the view that education consists in implanting certain tradi- 
tional knowledge in children’s minds, such as the way to work out 
sums and to parse sentences. This may be one objective in the secon- 
dary school and university also, though surely not one of the most 
important. History, science, foreign languages, and the like do involve 
the acquisition of a certain body of facts — dates, chemical formuke, 
meanings of words, etc. ; but they arc supposed to possess many 
additional values. Taste for literature, interest in foreign peoples, 
scientific habits of mind, ability to apply knowledge to the interpre- 
tation of present-day events, are but a few^ of the things which we 
hope to inculcate, in making such an analysis, the teacher must be 
careful to take into account the facts which psychologists have 
established in regard to transfer of training, namely that general 
faculties, like observation, memory, and reasoning, will not be deve- 
loped merely by the study of some particular school subject, and 
that indeed these faculties have little real existence. For example, a 
rational and scientific approach to contemporary problems can be 
developed by studying typical problems rationally, by conducting 
discussions and projects, by showing how irrationalities of thinking 
arise, and so on ; but not by the carrying out of stock experiments in 
physics, nor by memorizing the rules of grammar or logic. A Be- 
havioristic formulation should be helpful : instead of talking in terms 
of vague mental qualities, we should ask — what can the educated 
pupil or student do in each specific field of study, or in the w'orld at 
large, which the uneducated one cannot do ; and what are the precise 
elements of our teaching which are supposed to confer upon him 
these capacities? 
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Having formulated as candidly as possible our general philosophy 
of education, we can ask which of these objectives are to be examined. 
We may decide that desirable moral conduct and social adjustment 
cannot be expressed in written work. But each objective should be 
considered from the point of view — ^what can the educated pupil 
write, or what problems can he solve, which the uneducated one 
cannot. 

The external examiner, who has not been responsible for teaching 
the examinees, should adopt a similar procedure in deciding what his 
examination is meant to measure. He needs, however, to be especially 
careful to emphasize ‘can’ rather than ‘ought’ in his analysis. It is 
merely futile for him to complain that teachers are not doing their 
work properly, and to insist that examinees must conform to intellec- 
tual standards which may be current among persons like himself, but 
which are quite unattainable by children or immature students. His 
job is to set tasks which will differentiate the better pupils who exist 
in real, not ideal, schools, from poorer ones. If he is supposed to pick 
out, say, the best 30%, but formulates objectives which even the best 
5% can hardly achieve, then he is not doing his work effieiently and 
is creating unnecessary difficulties for himself when the time for 
marking arrives. 

A group of persons, such as a head master and his staff, or a board 
of examiners, should be able to carry out such analyses much more 
logically and thoroughly than can any one person, however clear- 
headed he may be, since they can criticize one another’s ideas. 

Finally, the conclusions of this analysis should be communicated 
by the examiners to their prospective examinees, or by the teachers 
to their pupils. At the present time examining frequently degenerates 
into a battle between the e’j^aminers who try to catch out their victims, 
and examinees who try to find out what their examiners really want, 
and what are their personal prejudices, by conning the papers they 
have set in the past. All this is a stupid waste of everybody’s time. 

2. Having decided upon the abilities that he wishes to measure, 
the teacher or examiner should consider the form of examination 
which will be most appropriate for each of them. French conversa- 
tion, qualitative chemical analysis, and other capacities obviously 
require special types. But in most instances the important point to be 
settled is whether the examination should be new-type, essay-type, 
or a mixture. Two pertinent considerations in this connection are the 
number bf examinees, and the examiner’s own skill in constructing 
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new-type questions. If the numbers are very small, he may be justified 
in shirking the extra expenditure of time required by objective exam- 
ining. And he should know from past experience that he is more 
successful in devising new-type measures of some things than of 
others. 

Generally speaking, any kind of factual information, or straight- 
forward skills, such as the carrying out of arithmetical operations or 
grammatical analyses, should be examined in new-type form, instead 
of being left to the vagaries of subjective marking. This need not 
mean that only trivial and unrelated items of knowledge or skill 
should be included. Good questions, as we have seen, can bring out 
the examinees’ ability to interpret the significance of facts, grasp of 
general principles, and application of such principles to specific new 
problems. True, there are many dangers in these more elaborate 
items, but they are hardly as serious as the dangers of trying to 
discover all the underlying abilities of the examinees from their 
answers to essay questions. 

We admit, however, that many important aspects of ability may 
be displayed more clearly in essay-type answers or (with subjects 
such as mathematics and physics) in the working out of long and 
complex problems. Literary style of composition, imaginative 
qualities and originality, evidence of the examinee’s interest in the 
subject and of his own thinking about it, are some of the things 
which the examiner may wish to elicit, even if he can never hope to 
measure them perfectly reliably. If the examination is designed 
primarily to bring out qualities such as these, then the examinees 
should be told so, and should be informed that amount and accuracy 
of information will not, in this instance, be taken into consideration. 
Frank directions about what is wanted are particularly desirable 
when the same examination includes some new-type and some 
essay-type questions. 

3. If, for one reason or another, essay-type questions have to be 
maids-of-all-work, then the examiner should ask himself which parts 
of the total field to be examined are the most important. Knowing 
that his questions can sample only a fraction of it, he should avoid 
the less essential parts, even at the risk of his questions being easily 
‘spotted’ beforehand. Overlapping among the questions is obviously 
undesirable also, since it is wasteful to cover any one part twice over. 
The unrepresentativeness of single papers, and the desirability of 
improving reliability by increasing their number should be borne in 



256 THE MEASUREMENT OF ABILITIES 

mind. With each question, the length of time needed for a reasonable 
answer should be considered, so that the examinees shall not be set 
too much, nor too little, to do in the period allowed. 

Much investigation has been made into the kinds of essay ques- 
tions which are least unreliable, particularly by the New York 
College Entrance Board (cf. Lindquist, 1951). It has been found that 
rather short discussion questions, each of which poses a definite 
problem and supplies specific indications of what is wanted, arc the 
best. Such questions still give plenty of scope for the examinee to 
reveal his powers of marshalling and organizing his material. The 
much more vague, yet very popular, type of question which allows 
greater freedom of response to the examinee only leads to confusion 
and difficulty in comparing one examinee with another. 

4. Perhaps the most important quality in a good examiner is his 
ability to foresee the sort of answers he will get, both through his 
insight into the minds of pupils or students, and through his experi- 
ence of their answers in the past. If in formulating each question he 
possesses a clear conception of the probable answers -good, 
average, and bad — the following advantages are likely to accrue. 
The questions will be simply worded, free from ambiguities, and 
they will not be interpreted in such divers ways that the answers are 
too heterogeneous to be compared, or marked consistently. The 
questions will also be appropriate in difiiculty, not, that is to say, the 
type of questions which no one but the examiner himself could 
answer adequately. Most important of all, the production of answers 
to these questions will really bring into play the abilities which the 
examination is aiming to measure. When, later on, the examiner 
reads and marks the answers, he should continually be trying to 
improve this ‘faculty,’ by noting how far his expectations are ful- 
filled. Every question may be regarded as an experiment in forecasting 
which, although never likely to be completely successful, may be- 
come progressively more so with practice. 

In this, as in all other branches of examining, two opinions are 
likely to be better than one, three than two. A board of four or five 
will perhaps be the most efficient. 

5. When essay-type questions call for answers of a largely factual 
nature, a detailed scheme of marking should be drawn up. This is 
especially necessary when several different examiners each mark a 
certain proportion of the papers. Experiments indicate that it will 
reduce their variability by roughly one-half. But a single examiner 



IMPROVEMENTS IN EXAMINING 257 

will also greatly improve his consistency if he decides beforehand 
just what points to look for, and marks these objectively.* 

In assessing the more general qualities which essay answers are 
supposed to bring out, the quality scale method (cf. p. 34) should be 
adopted. The examiner should always be on the alert to ensure that 
he is judging the scripts solely on the basis of the criteria which he 
has previously formulated, that he is not being biased by bad hand- 
writing or spelling, by some trick of style which jars on him, or by 
some personal predilection regarding the subject-matter. All marking 
which cannot be done objectively should consist largely in intro- 
spection: “Why docs this strike me as good or bad? Are these the 
qualities which I am supposed to be judging, and arc they the qualities 
which I took into account when I selected my standard specimens?” 

Many questions may advantageously be marked both by the 
qualitative and by the objective (count) systems, the figures then 
being combined in due proportion. If two or more examiners mark 
the same, or even a few of the same, scripts, it will be particularly 
valuable to analyse the reasons for any discrepancies in marks, so as 
to clarify still further their criteria of good and bad. 

One other source of prejudice often arises in marking if the 
examiner is personally acquainted with some, or all, of the examinees. 
Either he should carefully avoid looking at the names on the papers, 
or, if he cannot help recognizing their writing or style, should do his 
best to disregard any previous impressions he may have formed 
about them. 

6. With the statistical aspects of marking we have alieady dealt 
in Chaps. II IV. But the following main points in subjective mark- 
ing (as distinct from count scoring) may be recalled. Marking is first 
and foremost the arranging of scripts in order of mciit, or grouping 
them into categories of merit. It is not an attempt to give an absolute 
judgment of the value of each script. All the answers to one question 
should be classified into such categories before starling on another 
question. The categories should be, approximately, e\enly spaced, 
and should not exceed ten in number. The frequency distribution of 
marks for each question should tend to the normal shape, and the 
average mark should be very close to 5 (assuming the middle cate- 
gory of merit to have been labelled 5). The dispersion of the distri- 
bution should also be approximately constant for all questions, 
unless it is desired to weight some more heavily than others. The 
nal total scores for the scripts may be converted into an appropriate 
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distribution of percentage marks by any of the three methods given 
in Chap. IV. All decisions as to standards of passing and failing should 
be postponed until the process of marking has been completed. 

The final step in many examinations is to reduce the marks to two, 
three, or more categories (Pass or Fail ; First, Second, Third Class, 
etc.). Here it is most desirable that all scripts which fall within a 
few per cent, above or below any important borderline should be 
read a second time, preferably by more than one examiner, so as to 
ensure that the eventual decision shall attain the utmost reliability 
of which fallible human judgment is capable. 


ANSWERS TO THE EXAMINATION ON PP. 213-19j 


1. 12. 

28. 

None. 

2. 6. 

29. 

A. 

3. 150. 

30. 

B. 

4. 105 (104 or 106 may also 

31. 

69 or 62. 

count). 

32. 

— 4 or — 3. 

5. 80. 

33. 

C. 

6. 21. 

34. 

D. 

7. 32 (or 31). 

35. 

B. 

8. 17 (or 15-16). 

36. 

D. 

9. positive (or plus). 

37. 

B. 

10. faculties. 

38. 

C. 

11. Spearman. 

39. 

A. 

12. g. 

40. 

None. 

13. specific. 

41. 

D. 

14. g. 

42. 

E. (In Nos. 42-47 one 

1 5. group (or common). 

43. 

D. point is scored by each 

16. k, k: m or F. 

44. 

CF. answer in which one or 

17. +. (In Nos. 17-25, sub- 

45. 

A. both letters are given 

18. -1-. tract one mark for each 

46. 

CA. correctly. Nothing is 

19. — . incorrect answer. No 

47. 

EC. scored by any answer 

20. -f. guessing correction 

48. 

D. which also includes an 

21. — . need be applied in sub- 

49. 

B. incorrect letter.) 

22. — . sequent questions.) 

50. 

G. (For scoring of Nos. 

23. -. 

51. 

A. 48-54, see p. 242.) 

24. -I-. 

52. 

E. 

25. -. 

53. 

F. 

26. C. 

54. 

C. 


27. C. 
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group, 131, 13&-45, 147, 151, 152, 
161, 169, 195, 226, 250 
multiple, 140-1, 143-4, 146, 147, 152 
specific, 137-8, 140-2,^144, 145, 150, 
151, 161 

Faculties, 20, 130-2, 138, 143, 145, 224, 
253 

Formal discipline, 199, 253 
Free word association tests, 39 
French, tests of, 178 
Frequency distributions, asymmetrical, 
6/1., 15, 17, 31, 104 
bimodair 15 

combining and comparing, 53-62, 
123-4 

curves, area of, 11, 16 
definition of, 4 
dissimilarities in, 54 
graphical representation of, 7-11, 
12-17 


Frequency distributions (cont.) 
irregular. 13-15, 18, 24, 29-30, 35, 
90-1, 104 

normal, see Normal distribution 
of classified measures, 5-7 
of correlations, 76 
of differences, 76-80 
of means, 17, 75, 82-3 
rectangular, 26, 54 
skew, 15, 17-18, 23 4, 29, 30, 31-2, 
39, 54, 57 

statistical methods, for handling, 
Ch. Ill, 54-8 
tabulation of, 5-7 
Frequency polygon, 7-11, 13-16 
Function fluctuation, 128-9, 1 37//,, 200, 
209-10, 227 

^factor, 142-6, 152, 161, 168, 169 
Gaussian distribution, see Normal 
distribution 

Goodness of fit, 50-1, 90-1 
Group intelligence tests, application 
of, 68, 162-3, 164, 167, 190-2, 
249-50 

batteries, 142, 156 
choice of, 153, 188--9 
classified list of, 153, 181-4 
compared with individual, 162-6, 
167-70 

construction of, 152-3, 155-6, 188-9 
differential, 169-70 
distributions of scores on, 10-16, 32, 
50-1, 81, 83-4, 90-1, 188-9 
effects of health and mood on, 151, 
162, 164-5, 169, 191 
effects of practice and coaching on, 
127, 157/1., 165-7, 191, 195 
scoring of, 156, 164, 189-90, 191-2 
time limits, 132, 156, 163-4, 190 
validity of, 151-3, 167 -9, 249 
See also Mental tests 

Handedness tests, 186 
Handwriting ability, distribution of, 17 
factorial analysis of, 1 36 
marking of, 25, 29, 34 
tests of, 39, 135, 155, 175 
Heterogeneity of populations, 122-3, 
128, 135, 139, 201, 204/i., 205, 209, 
249 

Histogram, 7-11, 13 
Honours students, superior ability of, 
16, 81-2, 209 

Intelligence, age and, 69-71, 94, 249 
definition and nature of, 19, 130, 142, 
146, 152, 156-7, 168-9 
economic status and, 68, 70, 74-5, 
121, 158 
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Intelligence (cont,) 
educational attainments and, 69, 92, 
108, 122, 125, 139, 144, 146, 153, 
156-8, 164, 168-9, 194, 210, 
249-50 

hereditary and environmental influ- 
ences on, 68, 156-8, 167, 249 
language handicaps and, 158 
nationality and, 68, 158 
physique and, 134 
sex and, 16, 76-80, 157/j. 
subjective judgments of, 139, 148-50, 
151-2, 196 

vocation and, 144- 5, 146, 157, 195 
Intelligence Quotients (I.Q.), distribu- 
tion of, 1 5-19, 49 50, 52, 72-3, 195 
interpretation of, 19, 22, 36, 71-3, 
193^6 

prediction of, 112 18 
stabilityof, 126-7, 156-7, 165, 168, 195 
standard deviation of, 52, 71-3, 

157/1., 195 

Intelligence tests, see Group, Non- 
verbal, Terman-Merrill, etc., tests 
Interests, factorial analysis of, 141, 147 
intellectual, 134, 139, 141, 144, 146, 
157, 246 
tests, 187 

Internal consistency, 152-3 
International Examinations enquiry, 
203, 205- 6 
Interviews, 246- 8 

k or k: m factor, 144, 161 

Language development tests, 174, 176 
Letter grades, 1 1 , 26- 8, 205 
Linguistic and literary abilities, see 
English, French, Reading, etc. 

Machine-scoring of tests, 189-90 
Manual abilities, factorial analysis of, 
132, 134, 136, 137, 142, 144 
tests of, 132, 135, 186 
Marking of scholastic work, analytic, 
28, 203^, 207, 21 1-12, 223, 256-7 
by ranking, 25-6, 27, 28, 30, 32, 58, 
59, 65, 103, 257 

improvements in, 31-5, 58, 59-62, 
256-8 

objective vs. subjective, 11 12, 2(1-1, 
28-30, 32-3, 35, 65-6, 1 49, 1 55, 202- 
3,206-7, 211-12, 226 8, 255, 257 
standards of, 27-8, 33-5, 53, 58, 
63-5, 86, 150, 200, 202-3, 205, 
251-2 

teachers’ idiosyncrasies in, 6, 27, 29, 
33, 62-4, 202, 257 
See also Ability, Arithmetic, English 
composition. Examinations 
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Marks and scores, combination of, 22, 
26, 28, 55 6, 59, 123-5,204 
comparisons of, 2, 32, 53-4, 56-8, 
59-60, 69 

distributions of, 4-7, 11, 24, 26, 
28-30, 32-5, 48, 53-8, 59 62, 86, 
257 

scaling of, 33, 35, 56-62, 124, 202, 
230, 251 2, 257-8 

significance of, 24 -5, 27-30, 33, 36, 
39-40, Ch. IV, 150, 210 
weighting of, 54- 5, 66, 123, 125, 244, 
257 

Mathematics tests, 177, 185, 250 
Mean, arbitrary or guessed, 42-5, 
46-8, 85, 95 

Mean, arithmetic, calculation of, 1, 6/z., 
38-9, 40-5, 48 

effect of, in combining marks, 54~6 
S.E. of, 82-3 

Mean Variation (M.V.), 45, 48 
Measures, definition of, 3 
Mechanical ability (//i-factor), 130-1, 
144, 145 
tests of, 184 
Median, 36-9, 45, 58 
Memory, 130, 132, 142, 143, 145, 168, 
193, 253 

Mental Age (M.A.), 23, 69, 70-1, 156, 
159, 160, 161, 162, 171-2, 188-9, 
194-5 

Mental deterioration, 145 -6, 170 
Mental measurement, 19-20, 194; see 
also Ability 

contrasted with physical, 19, 21-4, 
30, 194, 196, 206 

Mental organisation, 130, 132, 138, 
143, 146-7, 224 5 
Mental speed, 130, 132, 144, 146, 
163-4 

Mental testers, desirable qualities in, 
191-3 

training needed, 73, 153, 159, 160, 
162, 164, 190, 193, 196 
Mental tests, classified list of, 153-4, 
170-87 

construction of, 20, 23, 143, 148-53 
further developments of, 169-70, 250 
instructions, 148, 149, 159, 189 90 
inter-relations of, 92, 94, 129, 130-5 : 

see also Factorial analysis 
limitations of, 67-73, 126-9, 149, 
151, 152, 159 60, 161-70, Ch. X, 
249-50 

norms, 25, 26, 28, 40, 55, 58, 66-73. 
150, 160, 161, 167, 171-2, 189, 
190, 191-2, 195 
origins of, 148 
prognostic, 151, 250 
publishers of, 171 
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Mental tests (cw//.) 
reliability of, 2, 87, 121«., 126 9, 131, 
137, 150, 164, 165, 168 9, 188, 
194-5 

scoring of, 149-50, 159 60, 164, 
189-90, 191 2 

standardisation of, 25, 67-8, 69-70, 
148 50, 153, 159 60, 195 
validity of, 2, 87 8, 118-19, 123 -5, 
129, 150 3, 167 9, 188, 249-50 
Multiple factor analysis, 140-1, 143 4, 
147 

Musical abilities, factorial analysis of, 
142 

tests of, 1 86 

New-type examinations, advantages 
over essay-lype, 32, 211-13, 220, 
223 -^ 

applicability of, 20-1 , 35, 66, 219 20, 
228, 239, 243, 255 
best-reason questions, 237 8, 239 
completion questions, 214, 228, 231, 
232-4 244 

defects of, 151, 211, 219-28, 232 3, 

235- 6, 240-1 

distributions of marks on, 29-30, 

32 3, 212, 220 -1,227, 230-1 
elementaristic tendency in, 223- 5, 
226, 228, 235, 255 

guessing factor in, 220 2, 232, 235, 
236, 237, 244, 258 
irrelevant clues in, 225-6, 233 4, 
236, 240-1, 244 

matching questions, 218-1 9, 23 1 , 
238-42, 244 

multiple-choice questions, 215-17, 

220, 222, 225, 228,'231, 234/7., 235, 

236- 42, 243, 244 
practical, 243 

principles of construction, 32-3, 153, 
212-13, 220, 224 6, 227-8, Ch. 
XIll 

rearrangement questions, 219, 242- 3 
reliability of, 210, 221 2, 226 8 
scoring of, 220, 234, 242-3, 244 
simple recall questions, 212, 213 14, 

221, 226, 231, 232 4, 239-40, 244 
specimen paper, 213 9, 258 
suggestive effect of false statements 

in, 222 

true-false questions, 214 15, 220-2, 
228, 231, 234-6, 244 
validity of, 151, 225, 226, 228, 245 
weighting of marks, 244 
Non-verbal intelligence tests, classified 
list of, 153 181-2 
factor analysis of, 144 
uses of, 158, 163, 250 
See also Performance tests 


Normal curve of error, sec Normal 
distribution 

Normal distribution, algebraic equa- 
tion of, 49 

applicability to reliability, 76 80 
application to test construction, 23 -4, 
31, 65, 188 

applications to school marking, 
17-18,24,26,31-5,55,59 62,205, 
257 

definition of, 1 7 
disturbing factors in, 17-18 
examples of, 1 7 
ordinates of, 108 
testing conformity to, 50-1, 90 1 
Norms, see Mental tests 
Null hypothesis, 88, 89, 105 
Numerical desciiption ofdisliibulions, 
36 7, 39^0, 49, 56 
Numerical evaluation of abilities, see 
Ability 

Ogive graph, In. 

Omnibus tests, 156, 190 
One-tail and two-tail signiticance tests 
19n. 

Oral group tests, 163, 181 

Passmarks, see Examinations 
Pearson-Bravais method, see Correla- 
tion 

Percentage marking, defects of, 28-31 ; 
see also Ability, numerical evalua- 
tion of 

Percentages, S.E. of, 84, 86, 90 
Percentile graphs, 56 61 
Percentiles, 36 -7, 38 40, 49, 58, 156, 
171, 189, 194 

comparison of distributions by, 

56 61, 65 

sizes of units, 59, 67 
Performance tests, application of, 156, 
158, 160-2, 169, 180 1, 189, 193 
classified list of, 153, 180 I 
construction of, 148- 50 
factor analysis of, 144 
improvement on, with age, 70 
limitations of, 126, 161 
scoring of, 21, 149, 155, 161, 189 
Perseveration tests, 104 
Peisonality, study of, 141, 147, 161, 
169, 193 4, 246, 248; also 
Character, Temperament 
tests, 154, 186-7 

Physical abilities, distribution of, 1 7 
measurement of, 23 
tests of, 132, 186 

Political opinions, distribution of, 

88 9 

scaling of, 19 
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Practical ability, 134, 136, 144, 151, 
161, 16S, 193 
factor (/<), 144/7., 161 
tests of, 1 S7 

Practical examinations in science, 243 
Practical teaching ability, 136, 146, 
249 

Practice, effects of, on intelligence 
tests, 127, 157 a 7 ., 165 7, 191, 195 
Practice sheets, 156, 189, 191 
Probability integral tables and graph, 
49 52, 59, 60 2, 77, 79 
Probability theory, 512, 63 4, 77 80, 
109, 221 

Probable Eiror (P.E.), 78 80, S7a/., 
109, 204; .see also Reliability 
Product-moment method, see 
Coi relation 

Product scales, see Ou«ility scales 
Projective techniques, 187 
Pupils’ lecords, .see Cumulative iccord 
caids 


Quality scales, 34, 149/;., 154 5, 175, 
178, 254 

Quartiles (Qj and Qj), 36 8, 40, 54, 56, 
58, 79 

Quota scheme, 251 

Range of measures, 1 , 6, 36, 40, 45, 48, 
51, 55, 66; see also Dispcision 
Ranking of measures, 4, 103 4, 195; 
see also Mai king 

Rank-order method, see CX)ii elation 
Rapport in leslmg, 165, 191 3 
Rate and power plans, see Ahilit> 
Reading abilitv, factorial anal>sis of, 
134 6 

marking of, 29, 56 

tests of, 21, 22, -75, 135, 151, 153, 

173 5 

Reading Age, 69 

Reasoning abilities, 143, 146, 157 8, 
223 5, 253 
tests of, 179, 1 86 

Recognition type items, 144, 167, 169, 
212, 222, 225 6, 232, 234 43 
Rcgiession, 113 16, 123, 133 
equations, 115 16 
multiple, 125 
non-linear, 104 5 

Reliability, coelhcients of, 126, 129 
effects of function nuetuation, 128 9 
effects of hetciogencity, 128 
effects of length of test, 126, 127-8 
formulae for, 815, 109-10, 117, 

126 8 

of correlations, 74, 76, 81, 87, 

109 10 


Reliability (cont.) 

of differences, 2, 74 5, 76 82, 84 5, 
86 

of means, 74 5, 8 1 , 82 3, 84 
of percentages, 84, 86, 90 
of predicted scores, 87 8, li7 19, 
123, 126 7, 157 
of scoies, 87, 126 7, 194 
of small samples, 85 
of standard deviations, 83 4 
of 7, no 
theory of, 74 80 

See also Examinations, Mental tests 


Sample test items, 156, 189, 191, 244 
Sanipics of the population, unselected, 
17, 67 70, 74 5, 80, 122 3 
Sampling errois, 63, C’h V, 105, 
109 10, 126-7, 128 9, 137/i., 138, 
142, 195, 20a-l 

Sampling of ability, see Ability 
Scaling, see Marks and scores 
Scattergram, 92 4, 98, 101, 104, 105, 
112 4, 117, 133 4 

Scatter of measures, see Dispersion 
Scholastic abilities, factorial analysis 
of, 92, 134 6, 142, 144 
tests of, see Educational, Ai ilhmelic, 
etc., tests 

Scientific approach to psychological 
pioblems, 3, 74, 76, 78, 86. 1 "tO 
Scotland, test norms in, 68 
Secondar\ education, selection loi, 
31, 71, 92, 119-20, 122, 124 -5, 
154, 163. 164, 165-7, 195//., 197 , 
198, 199, 205. 209. 223, 224, 228 
Semi-intcrquariile lange ((_^), 40, 45, 
48, 79 

Sensory-motor tests, 186 
Sex differences in abilitv, 2, 15 16, 75, 
76 81, 83 4, 157//. 
in cinema attendance. 89 90 
in milk-drinking, 84 
Simple strucluic, 141 
Skew’ disti ibution, see Eicquency 
distribution 

Small sample techniques, 85 
Spatial judgment (A -factor), 143, 144, 
161 

tests of, 1 84 

Spearman-Brown prophec>; formula, 
127, 201 

Speed factor in abilitv, 130, 132, 144, 
146, 163 4, 168, 169 
Spelling abilitv, factorial analysis of, 
134 6 

measurement of, 22, 25, 26 
tests of, 21, 135, 175, 234 
Spread of measures, see Dispersion 
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Standard Deviation (S.D. ora), 36, 40, 
45-52, 87 

differences between, 83 
of averaged measures, 123-4 
use of, in combining marks, 54-6, 
59 62, 65 

Standard Error (S.E.), 78, 87«., 88; 
see also Reliability 

Standardisation, see Mental tests 

Standard scores, 55-6, 59-62, 65, 67, 
69, 73, 115, 123, 156, 171, 194 

Standards of marking, see Marking of 
scholastic work 

Statistical methods, definition of, 1,2-3 
limitations of, 2-3, 80, 138-9, 146-7 
objects of, 1-3 

Statistical significance, 79-80 ; see also 
Reliability 


/, 85, 87, no 

Tabulated measures, average of, 40-5 
correlation between, 98-102 
percentiles of, 38 
standard deviation of, 47 -8 
Tabulation of measures, 1, 3-7 
Television viewing, 105-7 
Temperament tests, 19, 126, 161, 186-7 
factor analysis of, 141, 144, 146 
Tetrad difference criterion, 142 
Time limits in testing, 20-1, 132, 156, 
163^, 171, 190 

Training College students, abilities of, 
39-40, 75-6, 80, 86, 135, 136, 139, 
155, 249 

Trustworthiness of statistical results, 
see Reliability 
Two-factor theory, 141-5 


Unitary factor patterns, 141- 2, 147, 
152 

United States, see American 
University education, scleclion of 
students for, 2, 12^ 135, 197, 198, 
208-10, 228, 247, 248-9, 252 

Validity, Examinations, Mental tests 
Variability, see Dispersion, Reliability 
of attainments, 1-2, 17, 62-3, 129, 
131, 135, 200-2 

Variables, categoric, 88 90, 104-7, 108 
definition of, 3 

discrete and continuous, 3, 5, 50 
traits and abilities as, 19 
Variance, 87 

Verbal ability (v or v:e(l factor), 143, 
144, 145, 168, 193; w also 
English abHity 

Verbal tests, see Group intelligence 
tests 

Viva voce, see Examinations, oral 
Vocabulary, see Language develop- 
ment, Stanford-Binet, etc. 
Vocational abilities, factor analysis of, 
136 -7, 145 

Vocational guidance, 144-5, 154, 163, 
170, 248, 251 

Vocational selection, 119-20, 125, 135, 
145rt., 161,210, 247-8 
Vocational tests, 55, 118-19, 125, 126, 
151, 154, 184-6, 189 

War Office Selection Boards, 248 
Weighting, see Marks and scores 

A"-factor, 144 

z-techniques, 110-11 
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V ■ ABILITIES 


. N. V and Revved Edilicn 

.Primarily v iilcn I'or teachers, clinic workers, school 
medical olT -s and psychology students, ijiis bt)ok 
provides an 'Icn.., a y interpretation of the statistical 
techniques wiiich ai essential J)(T“a proper under- 
standing ol' ' icnt' ' measurement and sEovvs tlu- 
anplication ot these i 'm'ques to problems ol' testing 
and measuremi, v -n Piuglish and Scottish seh o‘s .md 
universities. 

This new editio” .'ri -'g.N the iv.,ok up-to-date in respeel 
of selection fi.' Svcoitdai educat'O'-' and presents, in 
the light ol' devv.' 'omci"^ hi t!.-' acid of iiuelligcnce 
theory and testing . norc leas.'nable picture of the 
relations of innate and ai • i .cd abilities and of the 
hndings of t’actorii I an 'L 'S than has hitherto Ken 
available to educali-m ar.d osychclogy siudenls. The 
eomprehci'tsive list of • "■■mtal tests has been 

completely rcci' Si^ur .. ..u.t sc'. * ai d' ilie statistical 
t- jrr - have occii revi'.?,. .’nd c'.pan • 



